Merge remote-tracking branch 'upstream/master' into release-1.6

reviewable/pr2926/r1
Devin Donnelly 2017-03-20 13:51:51 -07:00
commit 4956e43fb1
131 changed files with 7647 additions and 6853 deletions

View File

@ -3,45 +3,63 @@ abstract: "Detailed explanations of Kubernetes system concepts and abstractions.
toc:
- docs/concepts/index.md
- title: Kubectl Command Line
- title: Overview
section:
- docs/concepts/tools/kubectl/object-management-overview.md
- docs/concepts/tools/kubectl/object-management-using-imperative-commands.md
- docs/concepts/tools/kubectl/object-management-using-imperative-config.md
- docs/concepts/tools/kubectl/object-management-using-declarative-config.md
- docs/concepts/overview/what-is-kubernetes.md
- docs/concepts/overview/components.md
- title: Working with Kubernetes Objects
section:
- docs/concepts/overview/working-with-objects/kubernetes-objects.md
- docs/concepts/overview/working-with-objects/labels.md
- docs/concepts/overview/working-with-objects/annotations.md
- docs/concepts/overview/kubernetes-api.md
- title: Kubernetes Objects
section:
- docs/concepts/abstractions/overview.md
- title: Pods
section:
- docs/concepts/abstractions/pod.md
- docs/concepts/abstractions/init-containers.md
- title: Controllers
section:
- docs/concepts/abstractions/controllers/statefulsets.md
- docs/concepts/abstractions/controllers/petsets.md
- docs/concepts/abstractions/controllers/garbage-collection.md
- title: Object Metadata
section:
- docs/concepts/object-metadata/annotations.md
- title: Workloads
section:
- title: Pods
section:
- docs/concepts/workloads/pods/pod-overview.md
- docs/concepts/workloads/pods/pod-lifecycle.md
- docs/concepts/workloads/pods/init-containers.md
- title: Jobs
section:
- docs/concepts/jobs/run-to-completion-finite-workloads.md
- title: Clusters
- title: Cluster Administration
section:
- docs/concepts/clusters/logging.md
- docs/concepts/cluster-administration/manage-deployment.md
- docs/concepts/cluster-administration/networking.md
- docs/concepts/cluster-administration/network-plugins.md
- docs/concepts/cluster-administration/logging.md
- docs/concepts/cluster-administration/audit.md
- docs/concepts/cluster-administration/out-of-resource.md
- docs/concepts/cluster-administration/multiple-clusters.md
- docs/concepts/cluster-administration/federation.md
- docs/concepts/cluster-administration/guaranteed-scheduling-critical-addon-pods.md
- docs/concepts/cluster-administration/static-pod.md
- docs/concepts/cluster-administration/sysctl-cluster.md
- title: Services, Load Balancing, and Networking
section:
- docs/concepts/services-networking/dns-pod-service.md
- title: Configuration
section:
- docs/concepts/configuration/overview.md
- docs/concepts/configuration/container-command-args.md
- docs/concepts/configuration/manage-compute-resources-container.md
- title: Policies
section:
- docs/concepts/policy/container-capabilities.md
- docs/concepts/policy/resource-quotas.md

View File

@ -170,6 +170,7 @@ toc:
section:
- docs/admin/index.md
- docs/admin/cluster-management.md
- docs/admin/upgrade-1-6.md
- docs/admin/kubeadm.md
- docs/admin/addons.md
- docs/admin/node-allocatable.md

View File

@ -13,6 +13,8 @@ toc:
- docs/tasks/configure-pod-container/define-environment-variable-container.md
- docs/tasks/configure-pod-container/define-command-argument-container.md
- docs/tasks/configure-pod-container/assign-cpu-ram-container.md
- docs/tasks/configure-pod-container/limit-range.md
- docs/tasks/configure-pod-container/apply-resource-quota-limit.md
- docs/tasks/configure-pod-container/configure-volume-storage.md
- docs/tasks/configure-pod-container/configure-persistent-volume-storage.md
- docs/tasks/configure-pod-container/environment-variable-expose-pod-information.md
@ -23,6 +25,17 @@ toc:
- docs/tasks/configure-pod-container/communicate-containers-same-pod.md
- docs/tasks/configure-pod-container/configure-pod-initialization.md
- docs/tasks/configure-pod-container/attach-handler-lifecycle-event.md
- docs/tasks/configure-pod-container/configure-pod-disruption-budget.md
- title: Running Applications
section:
- docs/tasks/run-application/rolling-update-replication-controller.md
- title: Running Jobs
section:
- docs/tasks/job/parallel-processing-expansion.md
- docs/tasks/job/work-queue-1/index.md
- docs/tasks/job/fine-parallel-processing-work-queue/index.md
- title: Accessing Applications in a Cluster
section:
@ -34,6 +47,7 @@ toc:
- docs/tasks/debug-application-cluster/determine-reason-pod-failure.md
- docs/tasks/debug-application-cluster/debug-init-containers.md
- docs/tasks/debug-application-cluster/logging-stackdriver.md
- docs/tasks/debug-application-cluster/monitor-node-health.md
- docs/tasks/debug-application-cluster/logging-elasticsearch-kibana.md
- title: Accessing the Kubernetes API
@ -46,6 +60,18 @@ toc:
- docs/tasks/administer-cluster/dns-horizontal-autoscaling.md
- docs/tasks/administer-cluster/safely-drain-node.md
- docs/tasks/administer-cluster/change-pv-reclaim-policy.md
- docs/tasks/administer-cluster/limit-storage-consumption.md
- title: Administering Federation
section:
- docs/tasks/administer-federation/configmap.md
- docs/tasks/administer-federation/daemonset.md
- docs/tasks/administer-federation/deployment.md
- docs/tasks/administer-federation/events.md
- docs/tasks/administer-federation/ingress.md
- docs/tasks/administer-federation/namespaces.md
- docs/tasks/administer-federation/replicaset.md
- docs/tasks/administer-federation/secret.md
- title: Managing Stateful Applications
section:

View File

@ -32,11 +32,18 @@ toc:
- title: Online Training Course
path: https://www.udacity.com/course/scalable-microservices-with-kubernetes--ud615
- docs/tutorials/stateless-application/hello-minikube.md
- title: Object Management Using kubectl
section:
- docs/tutorials/object-management-kubectl/object-management.md
- docs/tutorials/object-management-kubectl/imperative-object-management-command.md
- docs/tutorials/object-management-kubectl/imperative-object-management-configuration.md
- docs/tutorials/object-management-kubectl/declarative-object-management-configuration.md
- title: Stateless Applications
section:
- docs/tutorials/stateless-application/run-stateless-application-deployment.md
- docs/tutorials/stateless-application/expose-external-ip-address-service.md
- docs/tutorials/stateless-application/expose-external-ip-address.md
- docs/tutorials/stateless-application/run-stateless-ap-replication-controller.md
- title: Stateful Applications
section:
- docs/tutorials/stateful-application/basic-stateful-set.md
@ -46,6 +53,12 @@ toc:
- title: Connecting Applications
section:
- docs/tutorials/connecting-apps/connecting-frontend-backend.md
- title: Clusters
section:
- docs/tutorials/clusters/apparmor.md
- title: Services
section:
- docs/tutorials/services/source-ip.md
- title: Federated Cluster Administration
section:
- docs/tutorials/federation/set-up-cluster-federation-kubefed.md

View File

@ -0,0 +1,12 @@
<table style="background-color:#eeeeee">
<tr>
<td>
<p><b>NOTICE</b></p>
<p>As of March 14, 2017, the <a href="https://github.com/orgs/kubernetes/teams/sig-docs-maintainers">@kubernetes/sig-docs-maintainers</a> have begun migration of the User Guide content as announced previously to the <a href="https://github.com/kubernetes/community/tree/master/sig-docs">SIG Docs community</a> through the <a href="https://groups.google.com/forum/#!forum/kubernetes-sig-docs">kubernetes-sig-docs</a> group and <a href="https://kubernetes.slack.com/messages/sig-docs/">kubernetes.slack.com #sig-docs</a> channel.</p>
<p>The user guides within this section are being refactored into topics within Tutorials, Tasks, and Concepts. Anything that has been moved will have a notice placed in its previous location as well as a link to its new location. The reorganization implements the table of contents (ToC) outlined in the <a href="https://docs.google.com/a/google.com/document/d/18hRCIorVarExB2eBVHTUR6eEJ2VVk5xq1iBmkQv8O6I/edit?usp=sharing">kubernetes-docs-toc</a> document and should improve the documentation's findability and readability for a wider range of audiences.</p>
<p>For any questions, please contact: <a href="mailto:kubernetes-sig-docs@googlegroups.com">kubernetes-sig-docs@googlegroups.com</a></p>
</td>
</tr>
</table>

View File

@ -1280,7 +1280,7 @@ $feature-box-div-margin-bottom: 40px
background-color: $white
box-shadow: 0 5px 5px rgba(0,0,0,.24),0 0 5px rgba(0,0,0,.12)
#calendarWrapper
#calendarMeetings
position: relative
width: 80vw
height: 60vw
@ -1288,6 +1288,14 @@ $feature-box-div-margin-bottom: 40px
max-height: 900px
margin: 20px auto
#calendarEvents
position: relative
width: 80vw
height: 30vw
max-width: 1200px
max-height: 450px
margin: 20px auto
iframe
position: absolute
border: 0

View File

@ -30,14 +30,14 @@ cid: community
<p>As a member of the Kubernetes community, you are welcome to join any of the SIG meetings
you are interested in. No registration required.</p>
<div id="calendarWrapper">
<div id="calendarMeetings">
<iframe src="https://calendar.google.com/calendar/embed?src=cgnt364vd8s86hr2phapfjc6uk%40group.calendar.google.com&ctz=America/Los_Angeles"
frameborder="0" scrolling="no"></iframe>
</div>
</div>
<div class="content">
<h3>Events</h3>
<div id="calendarWrapper">
<div id="calendarEvents">
<iframe src="https://calendar.google.com/calendar/embed?src=nt2tcnbtbied3l6gi2h29slvc0%40group.calendar.google.com&ctz=America/Los_Angeles"
frameborder="0" scrolling="no"></iframe>
</div>

View File

@ -4,389 +4,6 @@ assignees:
title: AppArmor
---
AppArmor is a Linux kernel enhancement that can reduce the potential attack surface of an
application and provide greater defense in depth for Applications. Beta support for AppArmor was
added in Kubernetes v1.4.
{% include user-guide-content-moved.md %}
* TOC
{:toc}
## What is AppArmor
AppArmor is a Linux kernel security module that supplements the standard Linux user and group based
permissions to confine programs to a limited set of resources. AppArmor can be configured for any
application to reduce its potential attack surface and provide greater defense in depth. It is
configured through profiles tuned to whitelist the access needed by a specific program or container,
such as Linux capabilities, network access, file permissions, etc. Each profile can be run in either
enforcing mode, which blocks access to disallowed resources, or complain mode, which only reports
violations.
AppArmor can help you to run a more secure deployment by restricting what containers are allowed to
do, and /or providing better auditing through system logs. However, it is important to keep in mind
that AppArmor is not a silver bullet, and can only do so much to protect against exploits in your
application code. It is important to provide good, restrictive profiles, and harden your
applications and cluster from other angles as well.
AppArmor support in Kubernetes is currently in beta.
## Prerequisites
1. **Kubernetes version is at least v1.4**. Kubernetes support for AppArmor was added in
v1.4. Kubernetes components older than v1.4 are not aware of the new AppArmor annotations, and
will **silently ignore** any AppArmor settings that are provided. To ensure that your Pods are
receiving the expected protections, it is important to verify the Kubelet version of your nodes:
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.kubeletVersion}\n{end}'
gke-test-default-pool-239f5d02-gyn2: v1.4.0
gke-test-default-pool-239f5d02-x1kf: v1.4.0
gke-test-default-pool-239f5d02-xwux: v1.4.0
2. **AppArmor kernel module is enabled**. For the Linux kernel to enforce an AppArmor profile, the
AppArmor kernel module must be installed and enabled. Several distributions enable the module by
default, such as Ubuntu and SUSE, and many others provide optional support. To check whether the
module is enabled, check the `/sys/module/apparmor/parameters/enabled` file:
$ cat /sys/module/apparmor/parameters/enabled
Y
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
options if the kernel module is not enabled.
*Note: Ubuntu carries many AppArmor patches that have not been merged into the upstream Linux
kernel, including patches that add additional hooks and features. Kubernetes has only been
tested with the upstream version, and does not promise support for other features.*
3. **Container runtime is Docker**. Currently the only Kubernetes-supported container runtime that
also supports AppArmor is Docker. As more runtimes add AppArmor support, the options will be
expanded. You can verify that your nodes are running docker with:
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.containerRuntimeVersion}\n{end}'
gke-test-default-pool-239f5d02-gyn2: docker://1.11.2
gke-test-default-pool-239f5d02-x1kf: docker://1.11.2
gke-test-default-pool-239f5d02-xwux: docker://1.11.2
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
options if the runtime is not Docker.
4. **Profile is loaded**. AppArmor is applied to a Pod by specifying an AppArmor profile that each
container should be run with. If any of the specified profiles is not already loaded in the
kernel, the Kubelet (>= v1.4) will reject the Pod. You can view which profiles are loaded on a
node by checking the `/sys/kernel/security/apparmor/profiles` file. For example:
$ ssh gke-test-default-pool-239f5d02-gyn2 "sudo cat /sys/kernel/security/apparmor/profiles | sort"
apparmor-test-deny-write (enforce)
apparmor-test-audit-write (enforce)
docker-default (enforce)
k8s-nginx (enforce)
For more details on loading profiles on nodes, see
[Setting up nodes with profiles](#setting-up-nodes-with-profiles).
As long as the Kubelet version includes AppArmor support (>= v1.4), the Kubelet will reject a Pod
with AppArmor options if any of the prerequisites are not met. You can also verify AppArmor support
on nodes by checking the node ready condition message (though this is likely to be removed in a
later release):
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {.status.conditions[?(@.reason=="KubeletReady")].message}\n{end}'
gke-test-default-pool-239f5d02-gyn2: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-x1kf: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-xwux: kubelet is posting ready status. AppArmor enabled
## Securing a Pod
*Note: AppArmor is currently in beta, so options are specified as annotations. Once support graduates to
general availability, the annotations will be replaced with first-class fields (more details in
[Upgrade path to GA](#upgrade-path-to-general-availability)).*
AppArmor profiles are specified *per-container*. To specify the AppArmor profile to run a Pod
container with, add an annotation to the Pod's metadata:
container.apparmor.security.beta.kubernetes.io/<container_name>: <profile_ref>
Where `<container_name>` is the name of the container to apply the profile to, and `<profile_ref>`
specifies the profile to apply. The `profile_ref` can be one of:
- `runtime/default` to apply the runtime's default profile.
- `localhost/<profile_name>` to apply the profile loaded on the host with the name `<profile_name>`
See the [API Reference](#api-reference) for the full details on the annotation and profile name formats.
The Kubernetes AppArmor enforcement works by first checking that all the prerequisites have been
met, and then forwarding the profile selection to the container runtime for enforcement. If the
prerequisites have not been met, the Pod will be rejected, and will not run.
To verify that the profile was applied, you can expect to see the AppArmor security option listed in the container created event:
$ kubectl get events | grep Created
22s 22s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet e2e-test-stclair-minion-group-31nt} Created container with docker id 269a53b202d3; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
You can also verify directly that the container's root process is running with the correct profile by checking its proc attr:
$ kubectl exec <pod_name> cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
## Example
In this example you'll see:
- One way to load a profile on a node
- How to enforce the profile on a Pod
- How to check that the profile is loaded
- What happens when a profile is violated
- What happens when a profile cannot be loaded
*This example assumes you have already set up a cluster with AppArmor support.*
First, we need to load the profile we want to use onto our nodes. The profile we'll use simply
denies all file writes:
{% include code.html language="text" file="deny-write.profile" ghlink="/docs/admin/apparmor/deny-write.profile" %}
Since we don't know where the Pod will be scheduled, we'll need to load the profile on all our
nodes. For this example we'll just use SSH to install the profiles, but other approaches are
discussed in [Setting up nodes with profiles](#setting-up-nodes-with-profiles).
$ NODES=(
# The SSH-accessible domain names of your nodes
gke-test-default-pool-239f5d02-gyn2.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-x1kf.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-xwux.us-central1-a.my-k8s)
$ for NODE in ${NODES[*]}; do ssh $NODE 'sudo apparmor_parser -q <<EOF
#include <tunables/global>
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
# Deny all file writes.
deny /** w,
}
EOF'
done
Next, we'll run a simple "Hello AppArmor" pod with the deny-write profile:
{% include code.html language="yaml" file="hello-apparmor-pod.yaml" ghlink="/docs/admin/apparmor/hello-apparmor-pod.yaml" %}
$ kubectl create -f /dev/stdin <<EOF
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor
annotations:
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
spec:
containers:
- name: hello
image: busybox
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
EOF
pod "hello-apparmor" created
If we look at the pod events, we can see that the Pod container was created with the AppArmor
profile "k8s-apparmor-example-deny-write":
$ kubectl get events | grep hello-apparmor
14s 14s 1 hello-apparmor Pod Normal Scheduled {default-scheduler } Successfully assigned hello-apparmor to gke-test-default-pool-239f5d02-gyn2
14s 14s 1 hello-apparmor Pod spec.containers{hello} Normal Pulling {kubelet gke-test-default-pool-239f5d02-gyn2} pulling image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Pulled {kubelet gke-test-default-pool-239f5d02-gyn2} Successfully pulled image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet gke-test-default-pool-239f5d02-gyn2} Created container with docker id 06b6cd1c0989; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Started {kubelet gke-test-default-pool-239f5d02-gyn2} Started container with docker id 06b6cd1c0989
We can verify that the container is actually running with that profile by checking its proc attr:
$ kubectl exec hello-apparmor cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
Finally, we can see what happens if we try to violate the profile by writing to a file:
$ kubectl exec hello-apparmor touch /tmp/test
touch: /tmp/test: Permission denied
error: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1
To wrap up, let's look at what happens if we try to specify a profile that hasn't been loaded:
$ kubectl create -f /dev/stdin <<EOF
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor-2
annotations:
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-allow-write
spec:
containers:
- name: hello
image: busybox
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
EOF
pod "hello-apparmor-2" created
$ kubectl describe pod hello-apparmor-2
Name: hello-apparmor-2
Namespace: default
Node: gke-test-default-pool-239f5d02-x1kf/
Start Time: Tue, 30 Aug 2016 17:58:56 -0700
Labels: <none>
Status: Failed
Reason: AppArmor
Message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
IP:
Controllers: <none>
Containers:
hello:
Image: busybox
Port:
Command:
sh
-c
echo 'Hello AppArmor!' && sleep 1h
Requests:
cpu: 100m
Environment Variables: <none>
Volumes:
default-token-dnz7v:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dnz7v
QoS Tier: Burstable
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
23s 23s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-apparmor-2 to e2e-test-stclair-minion-group-t1f5
23s 23s 1 {kubelet e2e-test-stclair-minion-group-t1f5} Warning AppArmor Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
Note the pod status is Failed, with a helpful error message: `Pod Cannot enforce AppArmor: profile
"k8s-apparmor-example-allow-write" is not loaded`. An event was also recorded with the same message.
## Administration
### Setting up nodes with profiles
Kubernetes does not currently provide any native mechanisms for loading AppArmor profiles onto
nodes. There are lots of ways to setup the profiles though, such as:
- Through a [DaemonSet](../daemons/) that runs a Pod on each node to
ensure the correct profiles are loaded. An example implementation can be found
[here](https://github.com/kubernetes/contrib/tree/master/apparmor/loader).
- At node initialization time, using your node initialization scripts (e.g. Salt, Ansible, etc.) or
image.
- By copying the profiles to each node and loading them through SSH, as demonstrated in the
[Example](#example).
The scheduler is not aware of which profiles are loaded onto which node, so the full set of profiles
must be loaded onto every node. An alternative approach is to add a node label for each profile (or
class of profiles) on the node, and use a
[node selector](../../user-guide/node-selection/) to ensure the Pod is run on a
node with the required profile.
### Restricting profiles with the PodSecurityPolicy
If the PodSecurityPolicy extension is enabled, cluster-wide AppArmor restrictions can be applied. To
enable the PodSecurityPolicy, two flags must be set on the `apiserver`:
--admission-control=PodSecurityPolicy[,others...]
--runtime-config=extensions/v1beta1/podsecuritypolicy[,others...]
With the extension enabled, the AppArmor options can be specified as annotations on the PodSecurityPolicy:
apparmor.security.beta.kubernetes.io/defaultProfileName: <profile_ref>
apparmor.security.beta.kubernetes.io/allowedProfileNames: <profile_ref>[,others...]
The default profile name option specifies the profile to apply to containers by default when none is
specified. The allowed profile names option specifies a list of profiles that Pod containers are
allowed to be run with. If both options are provided, the default must be allowed. The profiles are
specified in the same format as on containers. See the [API Reference](#api-reference) for the full
specification.
### Disabling AppArmor
If you do not want AppArmor to be available on your cluster, it can be disabled by a command-line flag:
--feature-gates=AppArmor=false
When disabled, any Pod that includes an AppArmor profile will fail validation with a "Forbidden"
error. Note that by default docker always enables the "docker-default" profile on non-privileged
pods (if the AppArmor kernel module is enabled), and will continue to do so even if the feature-gate
is disabled. The option to disable AppArmor will be removed when AppArmor graduates to general
availability (GA).
### Upgrading to Kubernetes v1.4 with AppArmor
No action is required with respect to AppArmor to upgrade your cluster to v1.4. However, if any
existing pods had an AppArmor annotation, they will not go through validation (or PodSecurityPolicy
admission). If permissive profiles are loaded on the nodes, a malicious user could pre-apply a
permissive profile to escalate the pod privileges above the docker-default. If this is a concern, it
is recommended to scrub the cluster of any pods containing an annotation with
`apparmor.security.beta.kubernetes.io`.
### Upgrade path to General Availability
When AppArmor is ready to be graduated to general availability (GA), the options currently specified
through annotations will be converted to fields. Supporting all the upgrade and downgrade paths
through the transition is very nuanced, and will be explained in detail when the transition
occurs. We will commit to supporting both fields and annotations for at least 2 releases, and will
explicitly reject the annotations for at least 2 releases after that.
## Authoring Profiles
Getting AppArmor profiles specified correctly can be a tricky business. Fortunately there are some
tools to help with that:
- `aa-genprof` and `aa-logprof` generate profile rules by monitoring an application's activity and
logs, and admitting the actions it takes. Further instructions are provided by the
[AppArmor documentation](http://wiki.apparmor.net/index.php/Profiling_with_tools).
- [bane](https://github.com/jfrazelle/bane) is an AppArmor profile generator for Docker that uses a
simplified profile language.
It is recommended to run your application through Docker on a development workstation to generate
the profiles, but there is nothing preventing running the tools on the Kubernetes node where your
Pod is running.
To debug problems with AppArmor, you can check the system logs to see what, specifically, was
denied. AppArmor logs verbose messages to `dmesg`, and errors can usually be found in the system
logs or through `journalctl`. More information is provided in
[AppArmor failures](http://wiki.apparmor.net/index.php/AppArmor_Failures).
Additional resources:
- [Quick guide to the AppArmor profile language](http://wiki.apparmor.net/index.php/QuickProfileLanguage)
- [AppArmor core policy reference](http://wiki.apparmor.net/index.php/ProfileLanguage)
## API Reference
**Pod Annotation**:
Specifying the profile a container will run with:
- **key**: `container.apparmor.security.beta.kubernetes.io/<container_name>`
Where `<container_name>` matches the name of a container in the Pod.
A separate profile can be specified for each container in the Pod.
- **value**: a profile reference, described below
**Profile Reference**:
- `runtime/default`: Refers to the default runtime profile.
- Equivalent to not specifying a profile (without a PodSecurityPolicy default), except it still
requires AppArmor to be enabled.
- For Docker, this resolves to the
[`docker-default`](https://docs.docker.com/engine/security/apparmor/) profile for non-privileged
containers, and unconfined (no profile) for privileged containers.
- `localhost/<profile_name>`: Refers to a profile loaded on the node (localhost) by name.
- The possible profile names are detailed in the
[core policy reference](http://wiki.apparmor.net/index.php/AppArmor_Core_Policy_Reference#Profile_names_and_attachment_specifications)
Any other profile reference format is invalid.
**PodSecurityPolicy Annotations**
Specifying the default profile to apply to containers when none is provided:
- **key**: `apparmor.security.beta.kubernetes.io/defaultProfileName`
- **value**: a profile reference, described above
Specifying the list of profiles Pod containers is allowed to specify:
- **key**: `apparmor.security.beta.kubernetes.io/allowedProfileNames`
- **value**: a comma-separated list of profile references (described above)
- Although an escaped comma is a legal character in a profile name, it cannot be explicitly
allowed here
[AppArmor](/docs/tutorials/clusters/apparmor/)

View File

@ -5,63 +5,6 @@ assignees:
title: Audit in Kubernetes
---
* TOC
{:toc}
{% include user-guide-content-moved.md %}
Kubernetes Audit provides a security-relevant chronological set of records documenting
the sequence of activities that have affected system by individual users, administrators
or other components of the system. It allows cluster administrator to
answer the following questions:
- what happened?
- when did it happen?
- who initiated it?
- on what did it happen?
- where was it observed?
- from where was it initiated?
- to where was it going?
NOTE: Currently, Kubernetes provides only basic audit capabilities, there is still a lot
of work going on to provide fully featured auditing capabilities (see [this issue](https://github.com/kubernetes/features/issues/22)).
Kubernetes audit is part of [kube-apiserver](/docs/admin/kube-apiserver) logging all requests
coming to the server. Each audit log contains two entries:
1. The request line containing:
- unique id allowing to match the response line (see 2)
- source ip of the request
- HTTP method being invoked
- original user invoking the operation
- impersonated user for the operation
- namespace of the request or <none>
- URI as requested
2. The response line containing:
- the unique id from 1
- response code
Example output for user `admin` asking for a list of pods:
```
2016-09-07T13:03:57.400333046Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" ip="127.0.0.1" method="GET" user="admin" as="<self>" namespace="default" uri="/api/v1/namespaces/default/pods"
2016-09-07T13:03:57.400710987Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" response="200"
```
NOTE: The audit capabilities are available *only* for the secured endpoint of the API server.
## Configuration
[Kube-apiserver](/docs/admin/kube-apiserver) provides following options which are responsible
for configuring where and how audit logs are handled:
- `audit-log-path` - enables the audit log pointing to a file where the requests are being logged to.
- `audit-log-maxage` - specifies maximum number of days to retain old audit log files based on the timestamp encoded in their filename.
- `audit-log-maxbackup` - specifies maximum number of old audit log files to retain.
- `audit-log-maxsize` - specifies maximum size in megabytes of the audit log file before it gets rotated. Defaults to 100MB
If an audit log file already exists, Kubernetes appends new audit logs to that file.
Otherwise, Kubernetes creates an audit log file at the location you specified in
`audit-log-path`. If the audit log file exceeds the size you specify in `audit-log-maxsize`,
Kubernetes will rename the current log file by appending the current timestamp on
the file name (before the file extension) and create a new audit log file.
Kubernetes may delete old log files when creating a new log file; you can configure
how many files are retained and how old they can be by specifying the `audit-log-maxbackup`
and `audit-log-maxage` options.
[Auditing](/docs/concepts/cluster-administration/audit/)

View File

@ -4,133 +4,6 @@ assignees:
title: Kubernetes Components
---
This document outlines the various binary components that need to run to
deliver a functioning Kubernetes cluster.
{% include user-guide-content-moved.md %}
## Master Components
Master components are those that provide the cluster's control plane. For
example, master components are responsible for making global decisions about the
cluster (e.g., scheduling), and detecting and responding to cluster events
(e.g., starting up a new pod when a replication controller's 'replicas' field is
unsatisfied).
In theory, Master components can be run on any node in the cluster. However,
for simplicity, current set up scripts typically start all master components on
the same VM, and does not run user containers on this VM. See
[high-availability.md](/docs/admin/high-availability) for an example multi-master-VM setup.
Even in the future, when Kubernetes is fully self-hosting, it will probably be
wise to only allow master components to schedule on a subset of nodes, to limit
co-running with user-run pods, reducing the possible scope of a
node-compromising security exploit.
### kube-apiserver
[kube-apiserver](/docs/admin/kube-apiserver) exposes the Kubernetes API; it is the front-end for the
Kubernetes control plane. It is designed to scale horizontally (i.e., one scales
it by running more of them-- [high-availability.md](/docs/admin/high-availability)).
### etcd
[etcd](/docs/admin/etcd) is used as Kubernetes' backing store. All cluster data is stored here.
Proper administration of a Kubernetes cluster includes a backup plan for etcd's
data.
### kube-controller-manager
[kube-controller-manager](/docs/admin/kube-controller-manager) is a binary that runs controllers, which are the
background threads that handle routine tasks in the cluster. Logically, each
controller is a separate process, but to reduce the number of moving pieces in
the system, they are all compiled into a single binary and run in a single
process.
These controllers include:
* Node Controller: Responsible for noticing & responding when nodes go down.
* Replication Controller: Responsible for maintaining the correct number of pods for every replication
controller object in the system.
* Endpoints Controller: Populates the Endpoints object (i.e., join Services & Pods).
* Service Account & Token Controllers: Create default accounts and API access tokens for new namespaces.
* ... and others.
### kube-scheduler
[kube-scheduler](/docs/admin/kube-scheduler) watches newly created pods that have no node assigned, and
selects a node for them to run on.
### addons
Addons are pods and services that implement cluster features. The pods may be managed
by Deployments, ReplicationContollers, etc. Namespaced addon objects are created in
the "kube-system" namespace.
Addon manager takes the responsibility for creating and maintaining addon resources.
See [here](http://releases.k8s.io/HEAD/cluster/addons) for more details.
#### DNS
While the other addons are not strictly required, all Kubernetes
clusters should have [cluster DNS](/docs/admin/dns/), as many examples rely on it.
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your
environment, which serves DNS records for Kubernetes services.
Containers started by Kubernetes automatically include this DNS server
in their DNS searches.
#### User interface
The kube-ui provides a read-only overview of the cluster state. Access
[the UI using kubectl proxy](/docs/user-guide/connecting-to-applications-proxy/#connecting-to-the-kube-ui-service-from-your-local-workstation)
#### Container Resource Monitoring
[Container Resource Monitoring](/docs/user-guide/monitoring) records generic time-series metrics
about containers in a central database, and provides a UI for browsing that data.
#### Cluster-level Logging
A [Cluster-level logging](/docs/user-guide/logging/overview) mechanism is responsible for
saving container logs to a central log store with search/browsing interface.
## Node components
Node components run on every node, maintaining running pods and providing them
the Kubernetes runtime environment.
### kubelet
[kubelet](/docs/admin/kubelet) is the primary node agent. It:
* Watches for pods that have been assigned to its node (either by apiserver
or via local configuration file) and:
* Mounts the pod's required volumes
* Downloads the pod's secrets
* Runs the pod's containers via docker (or, experimentally, rkt).
* Periodically executes any requested container liveness probes.
* Reports the status of the pod back to the rest of the system, by creating a
"mirror pod" if necessary.
* Reports the status of the node back to the rest of the system.
### kube-proxy
[kube-proxy](/docs/admin/kube-proxy) enables the Kubernetes service abstraction by maintaining
network rules on the host and performing connection forwarding.
### docker
`docker` is of course used for actually running containers.
### rkt
`rkt` is supported experimentally as an alternative to docker.
### supervisord
`supervisord` is a lightweight process babysitting system for keeping kubelet and docker
running.
### fluentd
`fluentd` is a daemon which helps provide [cluster-level logging](#cluster-level-logging).
[Kubernetes Components](/docs/concepts/overview/components/)

View File

@ -19,7 +19,9 @@ To install Kubernetes on a set of machines, consult one of the existing [Getting
## Upgrading a cluster
The current state of cluster upgrades is provider dependent.
The current state of cluster upgrades is provider dependent, and some releases may require special care when upgrading. It is recommended that administrators consult both the [release notes](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md), as well as the version specific upgrade notes prior to upgrading their clusters.
* [Upgrading to 1.6](/docs/admin/upgrade)
### Upgrading Google Compute Engine clusters
@ -56,8 +58,12 @@ The node upgrade process is user-initiated and is described in the [GKE document
### Upgrading clusters on other platforms
The `cluster/kube-push.sh` script will do a rudimentary update. This process is still quite experimental, we
recommend testing the upgrade on an experimental cluster before performing the update on a production cluster.
Different providers, and tools, will manage upgrades differently. It is recommended that you consult their main documentation regarding upgrades.
* [kops](https://github.com/kubernetes/kops)
* [kargo](https://github.com/kubernetes-incubator/kargo)
* [CoreOS Tectonic](https://coreos.com/tectonic/docs/latest/admin/upgrade.html)
* ...
## Resizing a cluster

View File

@ -51,7 +51,7 @@ A pod template in a DaemonSet must have a [`RestartPolicy`](/docs/user-guide/pod
### Pod Selector
The `.spec.selector` field is a pod selector. It works the same as the `.spec.selector` of
a [Job](/docs/user-guide/jobs/) or other new resources.
a [Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) or other new resources.
The `spec.selector` is an object consisting of two fields:

View File

@ -3,93 +3,7 @@ assignees:
- davidopp
title: Pod Disruption Budget
---
This guide is for anyone wishing to specify safety constraints on pods or anyone
wishing to write software (typically automation software) that respects those
constraints.
* TOC
{:toc}
{% include user-guide-content-moved.md %}
## Rationale
Various cluster management operations may voluntarily evict pods. "Voluntary"
means an eviction can be safely delayed for a reasonable period of time. The
principal examples today are draining a node for maintenance or upgrade
(`kubectl drain`), and cluster autoscaling down. In the future the
[rescheduler](https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduling.md)
may also perform voluntary evictions. By contrast, something like evicting pods
because a node has become unreachable or reports `NotReady`, is not "voluntary."
For voluntary evictions, it can be useful for applications to be able to limit
the number of pods that are down simultaneously. For example, a quorum-based application would
like to ensure that the number of replicas running is never brought below the
number needed for a quorum, even temporarily. Or a web front end might want to
ensure that the number of replicas serving load never falls below a certain
percentage of the total, even briefly. `PodDisruptionBudget` is an API object
that specifies the minimum number or percentage of replicas of a collection that
must be up at a time. Components that wish to evict a pod subject to disruption
budget use the `/eviction` subresource; unlike a regular pod deletion, this
operation may be rejected by the API server if the eviction would cause a
disruption budget to be violated.
## Specifying a PodDisruptionBudget
A `PodDisruptionBudget` has two components: a label selector `selector` to specify the set of
pods to which it applies, and `minAvailable` which is a description of the number of pods from that
set that must still be available after the eviction, i.e. even in the absence
of the evicted pod. `minAvailable` can be either an absolute number or a percentage.
So for example, 100% means no voluntary evictions from the set are permitted. In
typical usage, a single budget would be used for a collection of pods managed by
a controller—for example, the pods in a single ReplicaSet.
Note that a disruption budget does not truly guarantee that the specified
number/percentage of pods will always be up. For example, a node that hosts a
pod from the collection may fail when the collection is at the minimum size
specified in the budget, thus bringing the number of available pods from the
collection below the specified size. The budget can only protect against
voluntary evictions, not all causes of unavailability.
## Requesting an eviction
If you are writing infrastructure software that wants to produce these voluntary
evictions, you will need to use the eviction API. The eviction subresource of a
pod can be thought of as a kind of policy-controlled DELETE operation on the pod
itself. To attempt an eviction (perhaps more REST-precisely, to attempt to
*create* an eviction), you POST an attempted operation. Here's an example:
```json
{
"apiVersion": "policy/v1beta1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}
```
You can attempt an eviction using `curl`:
```bash
$ curl -v -H 'Content-type: application/json' http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json
```
The API can respond in one of three ways.
1. If the eviction is granted, then the pod is deleted just as if you had sent
a `DELETE` request to the pod's URL and you get back `200 OK`.
2. If the current state of affairs wouldn't allow an eviction by the rules set
forth in the budget, you get back `429 Too Many Requests`. This is
typically used for generic rate limiting of *any* requests, but here we mean
that this request isn't allowed *right now* but it may be allowed later.
Currently, callers do not get any `Retry-After` advice, but they may in
future versions.
3. If there is some kind of misconfiguration, like multiple budgets pointing at
the same pod, you will get `500 Internal Server Error`.
For a given eviction request, there are two cases.
1. There is no budget that matches this pod. In this case, the server always
returns `200 OK`.
2. There is at least one budget. In this case, any of the three above responses may
apply.
[Configuring a Pod Disruption Budget](/docs/tasks/configure-pod-container/configure-pod-disruption-budget/)

View File

@ -5,385 +5,6 @@ assignees:
title: Using DNS Pods and Services
---
## Introduction
{% include user-guide-content-moved.md %}
As of Kubernetes 1.3, DNS is a built-in service launched automatically using the addon manager [cluster add-on](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/README.md).
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures
the kubelets to tell individual containers to use the DNS Service's IP to
resolve DNS names.
## What things get DNS names?
Every Service defined in the cluster (including the DNS server itself) is
assigned a DNS name. By default, a client Pod's DNS search list will
include the Pod's own namespace and the cluster's default domain. This is best
illustrated by example:
Assume a Service named `foo` in the Kubernetes namespace `bar`. A Pod running
in namespace `bar` can look up this service by simply doing a DNS query for
`foo`. A Pod running in namespace `quux` can look up this service by doing a
DNS query for `foo.bar`.
## Supported DNS schema
The following sections detail the supported record types and layout that is
supported. Any other layout or names or queries that happen to work are
considered implementation details and are subject to change without warning.
### Services
#### A records
"Normal" (not headless) Services are assigned a DNS A record for a name of the
form `my-svc.my-namespace.svc.cluster.local`. This resolves to the cluster IP
of the Service.
"Headless" (without a cluster IP) Services are also assigned a DNS A record for
a name of the form `my-svc.my-namespace.svc.cluster.local`. Unlike normal
Services, this resolves to the set of IPs of the pods selected by the Service.
Clients are expected to consume the set or else use standard round-robin
selection from the set.
### SRV records
SRV Records are created for named ports that are part of normal or [Headless
Services](http://releases.k8s.io/docs/user-guide/services/#headless-services).
For each named port, the SRV record would have the form
`_my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster.local`.
For a regular service, this resolves to the port number and the CNAME:
`my-svc.my-namespace.svc.cluster.local`.
For a headless service, this resolves to multiple answers, one for each pod
that is backing the service, and contains the port number and a CNAME of the pod
of the form `auto-generated-name.my-svc.my-namespace.svc.cluster.local`.
### Backwards compatibility
Previous versions of kube-dns made names of the form
`my-svc.my-namespace.cluster.local` (the 'svc' level was added later). This
is no longer supported.
### Pods
#### A Records
When enabled, pods are assigned a DNS A record in the form of `pod-ip-address.my-namespace.pod.cluster.local`.
For example, a pod with IP `1.2.3.4` in the namespace `default` with a DNS name of `cluster.local` would have an entry: `1-2-3-4.default.pod.cluster.local`.
#### A Records and hostname based on Pod's hostname and subdomain fields
Currently when a pod is created, its hostname is the Pod's `metadata.name` value.
With v1.2, users can specify a Pod annotation, `pod.beta.kubernetes.io/hostname`, to specify what the Pod's hostname should be.
The Pod annotation, if specified, takes precedence over the Pod's name, to be the hostname of the pod.
For example, given a Pod with annotation `pod.beta.kubernetes.io/hostname: my-pod-name`, the Pod will have its hostname set to "my-pod-name".
With v1.3, the PodSpec has a `hostname` field, which can be used to specify the Pod's hostname. This field value takes precedence over the
`pod.beta.kubernetes.io/hostname` annotation value.
v1.2 introduces a beta feature where the user can specify a Pod annotation, `pod.beta.kubernetes.io/subdomain`, to specify the Pod's subdomain.
The final domain will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>".
For example, a Pod with the hostname annotation set to "foo", and the subdomain annotation set to "bar", in namespace "my-namespace", will have the FQDN "foo.bar.my-namespace.svc.cluster.local"
With v1.3, the PodSpec has a `subdomain` field, which can be used to specify the Pod's subdomain. This field value takes precedence over the
`pod.beta.kubernetes.io/subdomain` annotation value.
Example:
```yaml
apiVersion: v1
kind: Service
metadata:
name: default-subdomain
spec:
selector:
name: busybox
clusterIP: None
ports:
- name: foo # Actually, no port is needed.
port: 1234
targetPort: 1234
---
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
name: busybox
spec:
hostname: busybox-1
subdomain: default-subdomain
containers:
- image: busybox
command:
- sleep
- "3600"
name: busybox
---
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
name: busybox
spec:
hostname: busybox-2
subdomain: default-subdomain
containers:
- image: busybox
command:
- sleep
- "3600"
name: busybox
```
If there exists a headless service in the same namespace as the pod and with the same name as the subdomain, the cluster's KubeDNS Server also returns an A record for the Pod's fully qualified hostname.
Given a Pod with the hostname set to "busybox-1" and the subdomain set to "default-subdomain", and a headless Service named "default-subdomain" in the same namespace, the pod will see it's own FQDN as "busybox-1.default-subdomain.my-namespace.svc.cluster.local". DNS serves an A record at that name, pointing to the Pod's IP. Both pods "busybox1" and "busybox2" can have their distinct A records.
As of Kubernetes v1.2, the Endpoints object also has the annotation `endpoints.beta.kubernetes.io/hostnames-map`. Its value is the json representation of map[string(IP)][endpoints.HostRecord], for example: '{"10.245.1.6":{HostName: "my-webserver"}}'.
If the Endpoints are for a headless service, an A record is created with the format <hostname>.<service name>.<pod namespace>.svc.<cluster domain>
For the example json, if endpoints are for a headless service named "bar", and one of the endpoints has IP "10.245.1.6", an A record is created with the name "my-webserver.bar.my-namespace.svc.cluster.local" and the A record lookup would return "10.245.1.6".
This endpoints annotation generally does not need to be specified by end-users, but can used by the internal service controller to deliver the aforementioned feature.
With v1.3, The Endpoints object can specify the `hostname` for any endpoint, along with its IP. The hostname field takes precedence over the hostname value
that might have been specified via the `endpoints.beta.kubernetes.io/hostnames-map` annotation.
With v1.3, the following annotations are deprecated: `pod.beta.kubernetes.io/hostname`, `pod.beta.kubernetes.io/subdomain`, `endpoints.beta.kubernetes.io/hostnames-map`
## How do I test if it is working?
### Create a simple Pod to use as a test environment
Create a file named busybox.yaml with the
following contents:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always
```
Then create a pod using this file:
```
kubectl create -f busybox.yaml
```
### Wait for this pod to go into the running state
You can get its status with:
```
kubectl get pods busybox
```
You should see:
```
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 0 <some-time>
```
### Validate that DNS is working
Once that pod is running, you can exec nslookup in that environment:
```
kubectl exec -ti busybox -- nslookup kubernetes.default
```
You should see something like:
```
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: kubernetes.default
Address 1: 10.0.0.1
```
If you see that, DNS is working correctly.
### Troubleshooting Tips
If the nslookup command fails, check the following:
#### Check the local DNS configuration first
Take a look inside the resolv.conf file. (See "Inheriting DNS from the node" and "Known issues" below for more information)
```
kubectl exec busybox cat /etc/resolv.conf
```
Verify that the search path and name server are set up like the following (note that search path may vary for different cloud providers):
```
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.gce_project_id.internal
nameserver 10.0.0.10
options ndots:5
```
#### Quick diagnosis
Errors such as the following indicate a problem with the kube-dns add-on or associated Services:
```
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'kubernetes.default'
```
or
```
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
```
#### Check if the DNS pod is running
Use the kubectl get pods command to verify that the DNS pod is running.
```
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
```
You should see something like:
```
NAME READY STATUS RESTARTS AGE
...
kube-dns-v19-ezo1y 3/3 Running 0 1h
...
```
If you see that no pod is running or that the pod has failed/completed, the DNS add-on may not be deployed by default in your current environment and you will have to deploy it manually.
#### Check for Errors in the DNS pod
Use `kubectl logs` command to see logs for the DNS daemons.
```
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c kubedns
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c dnsmasq
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c healthz
```
See if there is any suspicious log. W, E, F letter at the beginning represent Warning, Error and Failure. Please search for entries that have these as the logging level and use [kubernetes issues](https://github.com/kubernetes/kubernetes/issues) to report unexpected errors.
#### Is DNS service up?
Verify that the DNS service is up by using the `kubectl get service` command.
```
kubectl get svc --namespace=kube-system
```
You should see:
```
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
kube-dns 10.0.0.10 <none> 53/UDP,53/TCP 1h
...
```
If you have created the service or in the case it should be created by default but it does not appear, see this [debugging services page](http://kubernetes.io/docs/user-guide/debugging-services/) for more information.
#### Are DNS endpoints exposed?
You can verify that DNS endpoints are exposed by using the `kubectl get endpoints` command.
```
kubectl get ep kube-dns --namespace=kube-system
```
You should see something like:
```
NAME ENDPOINTS AGE
kube-dns 10.180.3.17:53,10.180.3.17:53 1h
```
If you do not see the endpoints, see endpoints section in the [debugging services documentation](http://kubernetes.io/docs/user-guide/debugging-services/).
For additional Kubernetes DNS examples, see the [cluster-dns examples](https://github.com/kubernetes/kubernetes/tree/master/examples/cluster-dns) in the Kubernetes GitHub repository.
## Kubernetes Federation (Multiple Zone support)
Release 1.3 introduced Cluster Federation support for multi-site
Kubernetes installations. This required some minor
(backward-compatible) changes to the way
the Kubernetes cluster DNS server processes DNS queries, to facilitate
the lookup of federated services (which span multiple Kubernetes clusters).
See the [Cluster Federation Administrators' Guide](/docs/admin/federation) for more
details on Cluster Federation and multi-site support.
## How it Works
The running Kubernetes DNS pod holds 3 containers - kubedns, dnsmasq and a health check called healthz.
The kubedns process watches the Kubernetes master for changes in Services and Endpoints, and maintains
in-memory lookup structures to service DNS requests. The dnsmasq container adds DNS caching to improve
performance. The healthz container provides a single health check endpoint while performing dual healthchecks
(for dnsmasq and kubedns).
The DNS pod is exposed as a Kubernetes Service with a static IP. Once assigned the
kubelet passes DNS configured using the `--cluster-dns=10.0.0.10` flag to each
container.
DNS names also need domains. The local domain is configurable, in the kubelet using
the flag `--cluster-domain=<default local domain>`
The Kubernetes cluster DNS server (based off the [SkyDNS](https://github.com/skynetservices/skydns) library)
supports forward lookups (A records), service lookups (SRV records) and reverse IP address lookups (PTR records).
## Inheriting DNS from the node
When running a pod, kubelet will prepend the cluster DNS server and search
paths to the node's own DNS settings. If the node is able to resolve DNS names
specific to the larger environment, pods should be able to, also. See "Known
issues" below for a caveat.
If you don't want this, or if you want a different DNS config for pods, you can
use the kubelet's `--resolv-conf` flag. Setting it to "" means that pods will
not inherit DNS. Setting it to a valid file path means that kubelet will use
this file instead of `/etc/resolv.conf` for DNS inheritance.
## Known issues
Kubernetes installs do not configure the nodes' resolv.conf files to use the
cluster DNS by default, because that process is inherently distro-specific.
This should probably be implemented eventually.
Linux's libc is impossibly stuck ([see this bug from
2005](https://bugzilla.redhat.com/show_bug.cgi?id=168253)) with limits of just
3 DNS `nameserver` records and 6 DNS `search` records. Kubernetes needs to
consume 1 `nameserver` record and 3 `search` records. This means that if a
local installation already uses 3 `nameserver`s or uses more than 3 `search`es,
some of those settings will be lost. As a partial workaround, the node can run
`dnsmasq` which will provide more `nameserver` entries, but not more `search`
entries. You can also use kubelet's `--resolv-conf` flag.
If you are using Alpine version 3.3 or earlier as your base image, DNS may not
work properly owing to a known issue with Alpine. Check [here](https://github.com/kubernetes/kubernetes/issues/30215)
for more information.
## References
- [Docs for the DNS cluster addon](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/dns/README.md)
## What's next
- [Autoscaling the DNS Service in a Cluster](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/).
[DNS Pods and Services](/docs/concepts/services-networking/dns-pod-service/)

View File

@ -4,205 +4,6 @@ assignees:
title: Setting up Cluster Federation with Kubefed
---
* TOC
{:toc}
{% include user-guide-content-moved.md %}
Kubernetes version 1.5 includes a new command line tool called
`kubefed` to help you administrate your federated clusters.
`kubefed` helps you to deploy a new Kubernetes cluster federation
control plane, and to add clusters to or remove clusters from an
existing federation control plane.
This guide explains how to administer a Kubernetes Cluster Federation
using `kubefed`.
> Note: `kubefed` is an alpha feature in Kubernetes 1.5.
## Prerequisites
This guide assumes that you have a running Kubernetes cluster. Please
see one of the [getting started](/docs/getting-started-guides/) guides
for installation instructions for your platform.
## Getting `kubefed`
Download the client tarball corresponding to Kubernetes version 1.5
or later
[from the release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md),
extract the binaries in the tarball to one of the directories
in your `$PATH` and set the executable permission on those binaries.
Note: The URL in the curl command below downloads the binaries for
Linux amd64. If you are on a different platform, please use the URL
for the binaries appropriate for your platform. You can find the list
of available binaries on the [release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#client-binaries-1).
```shell
curl -O https://storage.googleapis.com/kubernetes-release/release/v1.5.2/kubernetes-client-linux-amd64.tar.gz
tar -xzvf kubernetes-client-linux-amd64.tar.gz
sudo cp kubernetes/client/bin/kubefed /usr/local/bin
sudo chmod +x /usr/local/bin/kubefed
sudo cp kubernetes/client/bin/kubectl /usr/local/bin
sudo chmod +x /usr/local/bin/kubectl
```
## Choosing a host cluster.
You'll need to choose one of your Kubernetes clusters to be the
*host cluster*. The host cluster hosts the components that make up
your federation control plane. Ensure that you have a `kubeconfig`
entry in your local `kubeconfig` that corresponds to the host cluster.
You can verify that you have the required `kubeconfig` entry by
running:
```shell
kubectl config get-contexts
```
The output should contain an entry corresponding to your host cluster,
similar to the following:
```
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1
```
You'll need to provide the `kubeconfig` context (called name in the
entry above) for your host cluster when you deploy your federation
control plane.
## Deploying a federation control plane.
To deploy a federation control plane on your host cluster, run
`kubefed init` command. When you use `kubefed init`, you must provide
the following:
* Federation name
* `--host-cluster-context`, the `kubeconfig` context for the host cluster
* `--dns-zone-name`, a domain name suffix for your federated services
The following example command deploys a federation control plane with
the name `fellowship`, a host cluster context `rivendell`, and the
domain suffix `example.com`:
```shell
kubefed init fellowship --host-cluster-context=rivendell --dns-zone-name="example.com"
```
The domain suffix specified in `--dns-zone-name` must be an existing
domain that you control, and that is programmable by your DNS provider.
`kubefed init` sets up the federation control plane in the host
cluster and also adds an entry for the federation API server in your
local kubeconfig. Note that in the alpha release in Kubernetes 1.5,
`kubefed init` does not automatically set the current context to the
newly deployed federation. You can set the current context manually by
running:
```shell
kubectl config use-context fellowship
```
where `fellowship` is the name of your federation.
## Adding a cluster to a federation
Once you've deployed a federation control plane, you'll need to make
that control plane aware of the clusters it should manage. You can add
a cluster to your federation by using the `kubefed join` command.
To use `kubefed join`, you'll need to provide the name of the cluster
you want to add to the federation, and the `--host-cluster-context`
for the federation control plane's host cluster.
The following example command adds the cluster `gondor` to the
federation with host cluster `rivendell`:
```
kubefed join gondor --host-cluster-context=rivendell
```
> Note: Kubernetes requires that you manually join clusters to a
federation because the federation control plane manages only those
clusters that it is responsible for managing. Adding a cluster tells
the federation control plane that it is responsible for managing that
cluster.
### Naming rules and customization
The cluster name you supply to `kubefed join` must be a valid RFC 1035
label.
Furthermore, federation control plane requires credentials of the
joined clusters to operate on them. These credentials are obtained
from the local kubeconfig. `kubefed join` uses the cluster name
specified as the argument to look for the cluster's context in the
local kubeconfig. If it fails to find a matching context, it exits
with an error.
This might cause issues in cases where context names for each cluster
in the federation don't follow
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules.
In such cases, you can specify a cluster name that conforms to the
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules
and specify the cluster context using the `--cluster-context` flag.
For example, if context of the cluster your are joining is
`gondor_needs-no_king`, then you can join the cluster by running:
```shell
kubefed join gondor --host-cluster-context=rivendell --cluster-context=gondor_needs-no_king
```
#### Secret name
Cluster credentials required by the federation control plane as
described above are stored as a secret in the host cluster. The name
of the secret is also derived from the cluster name.
However, the name of a secret object in Kubernetes should conform
to the DNS subdomain name specification described in
[RFC 1123](https://tools.ietf.org/html/rfc1123). If this isn't the
case, you can pass the secret name to `kubefed join` using the
`--secret-name` flag. For example, if the cluster name is `noldor` and
the secret name is `11kingdom`, you can join the cluster by
running:
```shell
kubefed join noldor --host-cluster-context=rivendell --secret-name=11kingdom
```
Note: If your cluster name does not conform to the DNS subdomain name
specification, all you need to do is supply the secret name via the
`--secret-name` flag. `kubefed join` automatically creates the secret
for you.
## Removing a cluster from a federation
To remove a cluster from a federation, run the `kubefed unjoin`
command with the cluster name and the federation's
`--host-cluster-context`:
```
kubefed unjoin gondor --host-cluster-context=rivendell
```
## Turning down the federation control plane:
Proper cleanup of federation control plane is not fully implemented in
this alpha release of `kubefed`. However, for the time being, deleting
the federation system namespace should remove all the resources except
the persistent storage volume dynamically provisioned for the
federation control plane's etcd. You can delete the federation
namespace by running the following command:
```
$ kubectl delete ns federation-system
```
[Setting up Cluster Federation with kubefed](/docs/tutorials/federation/set-up-cluster-federation-kubefed/)

View File

@ -5,210 +5,6 @@ assignees:
title: Setting Pod CPU and Memory Limits
---
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
system will be able to consume as much CPU and memory on the node that executes the pod.
{% include user-guide-content-moved.md %}
Users may want to impose restrictions on the amount of resources a single pod in the system may consume
for a variety of reasons.
For example:
1. Each node in the cluster has 2GB of memory. The cluster operator does not want to accept pods
that require more than 2GB of memory since no node in the cluster can support the requirement. To prevent a
pod from being permanently unscheduled to a node, the operator instead chooses to reject pods that exceed 2GB
of memory as part of admission control.
2. A cluster is shared by two communities in an organization that runs production and development workloads
respectively. Production workloads may consume up to 8GB of memory, but development workloads may consume up
to 512MB of memory. The cluster operator creates a separate namespace for each workload, and applies limits to
each namespace.
3. Users may create a pod which consumes resources just below the capacity of a machine. The left over space
may be too small to be useful, but big enough for the waste to be costly over the entire cluster. As a result,
the cluster operator may want to set limits that a pod must consume at least 20% of the memory and CPU of their
average node size in order to provide for more uniform scheduling and limit waste.
This example demonstrates how limits can be applied to a Kubernetes [namespace](/docs/admin/namespaces/walkthrough/) to control
min/max resource limits per pod. In addition, this example demonstrates how you can
apply default resource limits to pods in the absence of an end-user specified value.
See [LimitRange design doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_limit_range.md) for more information. For a detailed description of the Kubernetes resource model, see [Resources](/docs/user-guide/compute-resources/)
## Step 0: Prerequisites
This example requires a running Kubernetes cluster. See the [Getting Started guides](/docs/getting-started-guides/) for how to get started.
Change to the `<kubernetes>` directory if you're not already there.
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called limit-example:
```shell
$ kubectl create namespace limit-example
namespace "limit-example" created
```
Note that `kubectl` commands will print the type and name of the resource created or mutated, which can then be used in subsequent commands:
```shell
$ kubectl get namespaces
NAME STATUS AGE
default Active 51s
limit-example Active 45s
```
## Step 2: Apply a limit to the namespace
Let's create a simple limit in our namespace.
```shell
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
limitrange "mylimits" created
```
Let's describe the limits that we have imposed in our namespace.
```shell
$ kubectl describe limits mylimits --namespace=limit-example
Name: mylimits
Namespace: limit-example
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Pod cpu 200m 2 - - -
Pod memory 6Mi 1Gi - - -
Container cpu 100m 2 200m 300m -
Container memory 3Mi 1Gi 100Mi 200Mi -
```
In this scenario, we have said the following:
1. If a max constraint is specified for a resource (2 CPU and 1Gi memory in this case), then a limit
must be specified for that resource across all containers. Failure to specify a limit will result in
a validation error when attempting to create the pod. Note that a default value of limit is set by
*default* in file `limits.yaml` (300m CPU and 200Mi memory).
2. If a min constraint is specified for a resource (100m CPU and 3Mi memory in this case), then a
request must be specified for that resource across all containers. Failure to specify a request will
result in a validation error when attempting to create the pod. Note that a default value of request is
set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
3. For any pod, the sum of all containers memory requests must be >= 6Mi and the sum of all containers
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
containers CPU limits must be <= 2.
## Step 3: Enforcing limits at point of creation
The limits enumerated in a namespace are only enforced when a pod is created or updated in
the cluster. If you change the limits to a different value range, it does not affect pods that
were previously created in a namespace.
If a resource (CPU or memory) is being restricted by a limit, the user will get an error at time
of creation explaining why.
Let's first spin up a [Deployment](/docs/user-guide/deployments) that creates a single container Pod to demonstrate
how default values are applied to each pod.
```shell
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
deployment "nginx" created
```
Note that `kubectl run` creates a Deployment named "nginx" on Kubernetes cluster >= v1.2. If you are running older versions, it creates replication controllers instead.
If you want to obtain the old behavior, use `--generator=run/v1` to create replication controllers. See [`kubectl run`](/docs/user-guide/kubectl/kubectl_run/) for more details.
The Deployment manages 1 replica of single container Pod. Let's take a look at the Pod it manages. First, find the name of the Pod:
```shell
$ kubectl get pods --namespace=limit-example
NAME READY STATUS RESTARTS AGE
nginx-2040093540-s8vzu 1/1 Running 0 11s
```
Let's print this Pod with yaml output format (using `-o yaml` flag), and then `grep` the `resources` field. Note that your pod name will be different.
```shell
$ kubectl get pods nginx-2040093540-s8vzu --namespace=limit-example -o yaml | grep resources -C 8
resourceVersion: "57"
selfLink: /api/v1/namespaces/limit-example/pods/nginx-2040093540-ivimu
uid: 67b20741-f53b-11e5-b066-64510658e388
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
resources:
limits:
cpu: 300m
memory: 200Mi
requests:
cpu: 200m
memory: 100Mi
terminationMessagePath: /dev/termination-log
volumeMounts:
```
Note that our nginx container has picked up the namespace default CPU and memory resource *limits* and *requests*.
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 CPU cores.
```shell
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
```
Let's create a pod that falls within the allowed limit boundaries.
```shell
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
pod "valid-pod" created
```
Now look at the Pod's resources field:
```shell
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
uid: 3b1bfd7a-f53c-11e5-b066-64510658e388
spec:
containers:
- image: gcr.io/google_containers/serve_hostname
imagePullPolicy: Always
name: kubernetes-serve-hostname
resources:
limits:
cpu: "1"
memory: 512Mi
requests:
cpu: "1"
memory: 512Mi
```
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
default values.
Note: The *limits* for CPU resource are enforced in the default Kubernetes setup on the physical node
that runs the container unless the administrator deploys the kubelet with the following flag:
```shell
$ kubelet --help
Usage of kubelet
....
--cpu-cfs-quota[=true]: Enable CPU CFS quota enforcement for containers that specify CPU limits
$ kubelet --cpu-cfs-quota=false ...
```
## Step 4: Cleanup
To remove the resources used by this example, you can just delete the limit-example namespace.
```shell
$ kubectl delete namespace limit-example
namespace "limit-example" deleted
$ kubectl get namespaces
NAME STATUS AGE
default Active 12m
```
## Summary
Cluster operators that want to restrict the amount of resources a single container or pod may consume
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to
constrain the amount of resource a pod consumes on a node.
[Setting Pod CPU and Memory Limits](/docs/tasks/configure-pod-container/limit-range/)

View File

@ -4,63 +4,6 @@ assignees:
title: Using Multiple Clusters
---
You may want to set up multiple Kubernetes clusters, both to
have clusters in different regions to be nearer to your users, and to tolerate failures and/or invasive maintenance.
This document describes some of the issues to consider when making a decision about doing so.
{% include user-guide-content-moved.md %}
If you decide to have multiple clusters, Kubernetes provides a way to [federate them](/docs/admin/federation/).
## Scope of a single cluster
On IaaS providers such as Google Compute Engine or Amazon Web Services, a VM exists in a
[zone](https://cloud.google.com/compute/docs/zones) or [availability
zone](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).
We suggest that all the VMs in a Kubernetes cluster should be in the same availability zone, because:
- compared to having a single global Kubernetes cluster, there are fewer single-points of failure
- compared to a cluster that spans availability zones, it is easier to reason about the availability properties of a
single-zone cluster.
- when the Kubernetes developers are designing the system (e.g. making assumptions about latency, bandwidth, or
correlated failures) they are assuming all the machines are in a single data center, or otherwise closely connected.
It is okay to have multiple clusters per availability zone, though on balance we think fewer is better.
Reasons to prefer fewer clusters are:
- improved bin packing of Pods in some cases with more nodes in one cluster (less resource fragmentation)
- reduced operational overhead (though the advantage is diminished as ops tooling and processes matures)
- reduced costs for per-cluster fixed resource costs, e.g. apiserver VMs (but small as a percentage
of overall cluster cost for medium to large clusters).
Reasons to have multiple clusters include:
- strict security policies requiring isolation of one class of work from another (but, see Partitioning Clusters
below).
- test clusters to canary new Kubernetes releases or other cluster software.
## Selecting the right number of clusters
The selection of the number of Kubernetes clusters may be a relatively static choice, only revisited occasionally.
By contrast, the number of nodes in a cluster and the number of pods in a service may change frequently according to
load and growth.
To pick the number of clusters, first, decide which regions you need to be in to have adequate latency to all your end users, for services that will run
on Kubernetes (if you use a Content Distribution Network, the latency requirements for the CDN-hosted content need not
be considered). Legal issues might influence this as well. For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions.
Call the number of regions to be in `R`.
Second, decide how many clusters should be able to be unavailable at the same time, while still being available. Call
the number that can be unavailable `U`. If you are not sure, then 1 is a fine choice.
If it is allowable for load-balancing to direct traffic to any region in the event of a cluster failure, then
you need at least the larger of `R` or `U + 1` clusters. If it is not (e.g. you want to ensure low latency for all
users in the event of a cluster failure), then you need to have `R * (U + 1)` clusters
(`U + 1` in each of `R` regions). In any case, try to put each cluster in a different zone.
Finally, if any of your clusters would need more than the maximum recommended number of nodes for a Kubernetes cluster, then
you may need even more clusters. Kubernetes v1.3 supports clusters up to 1000 nodes in size.
## Working with multiple clusters
When you have multiple clusters, you would typically create services with the same config in each cluster and put each of those
service instances behind a load balancer (AWS Elastic Load Balancer, GCE Forwarding Rule or HTTP Load Balancer) spanning all of them, so that
failures of a single cluster are not visible to end users.
[Using Multiple Clusters](/docs/concepts/cluster-administration/multiple-clusters/)

View File

@ -6,68 +6,6 @@ assignees:
title: Network Plugins
---
* TOC
{:toc}
{% include user-guide-content-moved.md %}
__Disclaimer__: Network plugins are in alpha. Its contents will change rapidly.
Network plugins in Kubernetes come in a few flavors:
* CNI plugins: adhere to the appc/CNI specification, designed for interoperability.
* Kubenet plugin: implements basic `cbr0` using the `bridge` and `host-local` CNI plugins
## Installation
The kubelet has a single default network plugin, and a default network common to the entire cluster. It probes for plugins when it starts up, remembers what it found, and executes the selected plugin at appropriate times in the pod lifecycle (this is only true for docker, as rkt manages its own CNI plugins). There are two Kubelet command line parameters to keep in mind when using plugins:
* `network-plugin-dir`: Kubelet probes this directory for plugins on startup
* `network-plugin`: The network plugin to use from `network-plugin-dir`. It must match the name reported by a plugin probed from the plugin directory. For CNI plugins, this is simply "cni".
## Network Plugin Requirements
Besides providing the [`NetworkPlugin` interface](https://github.com/kubernetes/kubernetes/tree/{{page.version}}/pkg/kubelet/network/plugins.go) to configure and clean up pod networking, the plugin may also need specific support for kube-proxy. The iptables proxy obviously depends on iptables, and the plugin may need to ensure that container traffic is made available to iptables. For example, if the plugin connects containers to a Linux bridge, the plugin must set the `net/bridge/bridge-nf-call-iptables` sysctl to `1` to ensure that the iptables proxy functions correctly. If the plugin does not use a Linux bridge (but instead something like Open vSwitch or some other mechanism) it should ensure container traffic is appropriately routed for the proxy.
By default if no kubelet network plugin is specified, the `noop` plugin is used, which sets `net/bridge/bridge-nf-call-iptables=1` to ensure simple configurations (like docker with a bridge) work correctly with the iptables proxy.
### CNI
The CNI plugin is selected by passing Kubelet the `--network-plugin=cni` command-line option. Kubelet reads a file from `--cni-conf-dir` (default `/etc/cni/net.d`) and uses the CNI configuration from that file to set up each pod's network. The CNI configuration file must match the [CNI specification](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration), and any required CNI plugins referenced by the configuration must be present in `--cni-bin-dir` (default `/opt/cni/bin`).
If there are multiple CNI configuration files in the directory, the first one in lexicographic order of file name is used.
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI [`lo`](https://github.com/containernetworking/cni/blob/master/plugins/main/loopback/loopback.go) plugin, at minimum version 0.2.0
Limitation: Due to [#31307](https://github.com/kubernetes/kubernetes/issues/31307), `HostPort` won't work with CNI networking plugin at the moment. That means all `hostPort` attribute in pod would be simply ignored.
### kubenet
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of itself, implement more advanced features like cross-node networking or network policy. It is typically used together with a cloud provider that sets up routing rules for communication between nodes, or in single-node environments.
Kubenet creates a Linux bridge named `cbr0` and creates a veth pair for each pod with the host end of each pair connected to `cbr0`. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager. `cbr0` is assigned an MTU matching the smallest MTU of an enabled normal interface on the host.
The plugin requires a few things:
* The standard CNI `bridge`, `lo` and `host-local` plugins are required, at minimum version 0.2.0. Kubenet will first search for them in `/opt/cni/bin`. Specify `network-plugin-dir` to supply additional search path. The first found match will take effect.
* Kubelet must be run with the `--network-plugin=kubenet` argument to enable the plugin
* Kubelet should also be run with the `--non-masquerade-cidr=<clusterCidr>` argument to ensure traffic to IPs outside this range will use IP masquerade.
* The node must be assigned an IP subnet through either the `--pod-cidr` kubelet command-line option or the `--allocate-node-cidrs=true --cluster-cidr=<cidr>` controller-manager command-line options.
### Customizing the MTU (with kubenet)
The MTU should always be configured correctly to get the best networking performance. Network plugins will usually try
to infer a sensible MTU, but sometimes the logic will not result in an optimal MTU. For example, if the
Docker bridge or another interface has a small MTU, kubenet will currently select that MTU. Or if you are
using IPSEC encapsulation, the MTU must be reduced, and this calculation is out-of-scope for
most network plugins.
Where needed, you can specify the MTU explicitly with the `network-plugin-mtu` kubelet option. For example,
on AWS the `eth0` MTU is typically 9001, so you might specify `--network-plugin-mtu=9001`. If you're using IPSEC you
might reduce it to allow for encapsulation overhead e.g. `--network-plugin-mtu=8873`.
This option is provided to the network-plugin; currently **only kubenet supports `network-plugin-mtu`**.
## Usage Summary
* `--network-plugin=cni` specifies that we use the `cni` network plugin with actual CNI plugin binaries located in `--cni-bin-dir` (default `/opt/cni/bin`) and CNI plugin configuration located in `--cni-conf-dir` (default `/etc/cni/net.d`).
* `--network-plugin=kubenet` specifies that we use the `kubenet` network plugin with CNI `bridge` and `host-local` plugins placed in `/opt/cni/bin` or `network-plugin-dir`.
* `--network-plugin-mtu=9001` specifies the MTU to use, currently only used by the `kubenet` network plugin.
[Network Plugins](/docs/concepts/cluster-administration/network-plugins/)

View File

@ -4,212 +4,6 @@ assignees:
title: Networking in Kubernetes
---
Kubernetes approaches networking somewhat differently than Docker does by
default. There are 4 distinct networking problems to solve:
{% include user-guide-content-moved.md %}
1. Highly-coupled container-to-container communications: this is solved by
[pods](/docs/user-guide/pods/) and `localhost` communications.
2. Pod-to-Pod communications: this is the primary focus of this document.
3. Pod-to-Service communications: this is covered by [services](/docs/user-guide/services/).
4. External-to-Service communications: this is covered by [services](/docs/user-guide/services/).
* TOC
{:toc}
## Summary
Kubernetes assumes that pods can communicate with other pods, regardless of
which host they land on. We give every pod its own IP address so you do not
need to explicitly create links between pods and you almost never need to deal
with mapping container ports to host ports. This creates a clean,
backwards-compatible model where pods can be treated much like VMs or physical
hosts from the perspectives of port allocation, naming, service discovery, load
balancing, application configuration, and migration.
To achieve this we must impose some requirements on how you set up your cluster
networking.
## Docker model
Before discussing the Kubernetes approach to networking, it is worthwhile to
review the "normal" way that networking works with Docker. By default, Docker
uses host-private networking. It creates a virtual bridge, called `docker0` by
default, and allocates a subnet from one of the private address blocks defined
in [RFC1918](https://tools.ietf.org/html/rfc1918) for that bridge. For each
container that Docker creates, it allocates a virtual ethernet device (called
`veth`) which is attached to the bridge. The veth is mapped to appear as `eth0`
in the container, using Linux namespaces. The in-container `eth0` interface is
given an IP address from the bridge's address range.
The result is that Docker containers can talk to other containers only if they
are on the same machine (and thus the same virtual bridge). Containers on
different machines can not reach each other - in fact they may end up with the
exact same network ranges and IP addresses.
In order for Docker containers to communicate across nodes, they must be
allocated ports on the machine's own IP address, which are then forwarded or
proxied to the containers. This obviously means that containers must either
coordinate which ports they use very carefully or else be allocated ports
dynamically.
## Kubernetes model
Coordinating ports across multiple developers is very difficult to do at
scale and exposes users to cluster-level issues outside of their control.
Dynamic port allocation brings a lot of complications to the system - every
application has to take ports as flags, the API servers have to know how to
insert dynamic port numbers into configuration blocks, services have to know
how to find each other, etc. Rather than deal with this, Kubernetes takes a
different approach.
Kubernetes imposes the following fundamental requirements on any networking
implementation (barring any intentional network segmentation policies):
* all containers can communicate with all other containers without NAT
* all nodes can communicate with all containers (and vice-versa) without NAT
* the IP that a container sees itself as is the same IP that others see it as
What this means in practice is that you can not just take two computers
running Docker and expect Kubernetes to work. You must ensure that the
fundamental requirements are met.
This model is not only less complex overall, but it is principally compatible
with the desire for Kubernetes to enable low-friction porting of apps from VMs
to containers. If your job previously ran in a VM, your VM had an IP and could
talk to other VMs in your project. This is the same basic model.
Until now this document has talked about containers. In reality, Kubernetes
applies IP addresses at the `Pod` scope - containers within a `Pod` share their
network namespaces - including their IP address. This means that containers
within a `Pod` can all reach each other's ports on `localhost`. This does imply
that containers within a `Pod` must coordinate port usage, but this is no
different than processes in a VM. We call this the "IP-per-pod" model. This
is implemented in Docker as a "pod container" which holds the network namespace
open while "app containers" (the things the user specified) join that namespace
with Docker's `--net=container:<id>` function.
As with Docker, it is possible to request host ports, but this is reduced to a
very niche operation. In this case a port will be allocated on the host `Node`
and traffic will be forwarded to the `Pod`. The `Pod` itself is blind to the
existence or non-existence of host ports.
## How to achieve this
There are a number of ways that this network model can be implemented. This
document is not an exhaustive study of the various methods, but hopefully serves
as an introduction to various technologies and serves as a jumping-off point.
The following networking options are sorted alphabetically - the order does not
imply any preferential status.
### Contiv
[Contiv](https://github.com/contiv/netplugin) provides configurable networking (native l3 using BGP, overlay using vxlan, classic l2, or Cisco-SDN/ACI) for various use cases. [Contiv](http://contiv.io) is all open sourced.
### Flannel
[Flannel](https://github.com/coreos/flannel#flannel) is a very simple overlay
network that satisfies the Kubernetes requirements. Many
people have reported success with Flannel and Kubernetes.
### Google Compute Engine (GCE)
For the Google Compute Engine cluster configuration scripts, we use [advanced
routing](https://cloud.google.com/compute/docs/networking#routing) to
assign each VM a subnet (default is `/24` - 254 IPs). Any traffic bound for that
subnet will be routed directly to the VM by the GCE network fabric. This is in
addition to the "main" IP address assigned to the VM, which is NAT'ed for
outbound internet access. A linux bridge (called `cbr0`) is configured to exist
on that subnet, and is passed to docker's `--bridge` flag.
We start Docker with:
```shell
DOCKER_OPTS="--bridge=cbr0 --iptables=false --ip-masq=false"
```
This bridge is created by Kubelet (controlled by the `--network-plugin=kubenet`
flag) according to the `Node`'s `spec.podCIDR`.
Docker will now allocate IPs from the `cbr-cidr` block. Containers can reach
each other and `Nodes` over the `cbr0` bridge. Those IPs are all routable
within the GCE project network.
GCE itself does not know anything about these IPs, though, so it will not NAT
them for outbound internet traffic. To achieve that we use an iptables rule to
masquerade (aka SNAT - to make it seem as if packets came from the `Node`
itself) traffic that is bound for IPs outside the GCE project network
(10.0.0.0/8).
```shell
iptables -t nat -A POSTROUTING ! -d 10.0.0.0/8 -o eth0 -j MASQUERADE
```
Lastly we enable IP forwarding in the kernel (so the kernel will process
packets for bridged containers):
```shell
sysctl net.ipv4.ip_forward=1
```
The result of all this is that all `Pods` can reach each other and can egress
traffic to the internet.
### L2 networks and linux bridging
If you have a "dumb" L2 network, such as a simple switch in a "bare-metal"
environment, you should be able to do something similar to the above GCE setup.
Note that these instructions have only been tried very casually - it seems to
work, but has not been thoroughly tested. If you use this technique and
perfect the process, please let us know.
Follow the "With Linux Bridge devices" section of [this very nice
tutorial](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) from
Lars Kellogg-Stedman.
### Nuage Networks VCS (Virtualized Cloud Services)
[Nuage](http://www.nuagenetworks.net) provides a highly scalable policy-based Software-Defined Networking (SDN) platform. Nuage uses the open source Open vSwitch for the data plane along with a feature rich SDN Controller built on open standards.
The Nuage platform uses overlays to provide seamless policy-based networking between Kubernetes Pods and non-Kubernetes environments (VMs and bare metal servers). Nuage's policy abstraction model is designed with applications in mind and makes it easy to declare fine-grained policies for applications.The platform's real-time analytics engine enables visibility and security monitoring for Kubernetes applications.
### OpenVSwitch
[OpenVSwitch](/docs/admin/ovs-networking) is a somewhat more mature but also
complicated way to build an overlay network. This is endorsed by several of the
"Big Shops" for networking.
### OVN (Open Virtual Networking)
OVN is an opensource network virtualization solution developed by the
Open vSwitch community. It lets one create logical switches, logical routers,
stateful ACLs, load-balancers etc to build different virtual networking
topologies. The project has a specific Kubernetes plugin and documentation
at [ovn-kubernetes](https://github.com/openvswitch/ovn-kubernetes).
### Project Calico
[Project Calico](http://docs.projectcalico.org/) is an open source container networking provider and network policy engine.
Calico provides a highly scalable networking and network policy solution for connecting Kubernetes pods based on the same IP networking principles as the internet. Calico can be deployed without encapsulation or overlays to provide high-performance, high-scale data center networking. Calico also provides fine-grained, intent based network security policy for Kubernetes pods via its distributed firewall.
Calico can also be run in policy enforcement mode in conjunction with other networking solutions such as Flannel, aka [canal](https://github.com/tigera/canal), or native GCE networking.
### Romana
[Romana](http://romana.io) is an open source network and security automation solution that lets you deploy Kubernetes without an overlay network. Romana supports Kubernetes [Network Policy](/docs/user-guide/networkpolicies/) to provide isolation across network namespaces.
### Weave Net from Weaveworks
[Weave Net](https://www.weave.works/products/weave-net/) is a
resilient and simple to use network for Kubernetes and its hosted applications.
Weave Net runs as a [CNI plug-in](https://www.weave.works/docs/net/latest/cni-plugin/)
or stand-alone. In either version, it doesn't require any configuration or extra code
to run, and in both cases, the network provides one IP address per pod - as is standard for Kubernetes.
## Other reading
The early design of the networking model and its rationale, and some future
plans are described in more detail in the [networking design
document](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/networking.md).
[Cluster Networking](/docs/concepts/cluster-administration/networking/)

View File

@ -5,244 +5,6 @@ assignees:
title: Monitoring Node Health
---
* TOC
{:toc}
{% include user-guide-content-moved.md %}
## Node Problem Detector
*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
node health. It collects node problems from various daemons and reports them
to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
[Event](/docs/api-reference/v1/definitions/#_v1_event).
It supports some known kernel issue detection now, and will detect more and
more node problems over time.
Currently Kubernetes won't take any action on the node conditions and events
generated by node problem detector. In the future, a remedy system could be
introduced to deal with node problems.
See more information
[here](https://github.com/kubernetes/node-problem-detector).
## Limitations
* The kernel issue detection of node problem detector only supports file based
kernel log now. It doesn't support log tools like journald.
* The kernel issue detection of node problem detector has assumption on kernel
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
## Enable/Disable in GCE cluster
Node problem detector is [running as a cluster addon](cluster-large.md/#addon-resources) enabled by default in the
gce cluster.
You can enable/disable it by setting the environment variable
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
## Use in Other Environment
To enable node problem detector in other environment outside of GCE, you can use
either `kubectl` or addon pod.
### Kubectl
This is the recommended way to start node problem detector outside of GCE. It
provides more flexible management, such as overwriting the default
configuration to fit it into your environment or detect
customized node problems.
* **Step 1:** Create `node-problem-detector.yaml`:
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-problem-detector-v0.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
hostNetwork: true
containers:
- name: node-problem-detector
image: gcr.io/google_containers/node-problem-detector:v0.1
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
volumeMounts:
- name: log
mountPath: /log
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
```
***Notice that you should make sure the system log directory is right for your
OS distro.***
* **Step 2:** Start node problem detector with `kubectl`:
```shell
kubectl create -f node-problem-detector.yaml
```
### Addon Pod
This is for those who have their own cluster bootstrap solution, and don't need
to overwrite the default configuration. They could leverage the addon pod to
further automate the deployment.
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
`/etc/kubernetes/addons/node-problem-detector` on master node.
## Overwrite the Configuration
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
is embedded when building the docker image of node problem detector.
However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
following the steps:
* **Step 1:** Change the config files in `config/`.
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
node-problem-detector-config --from-file=config/`.
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-problem-detector-v0.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
hostNetwork: true
containers:
- name: node-problem-detector
image: gcr.io/google_containers/node-problem-detector:v0.1
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
volumeMounts:
- name: log
mountPath: /log
readOnly: true
- name: config # Overwrite the config/ directory with ConfigMap volume
mountPath: /config
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
- name: config # Define ConfigMap volume
configMap:
name: node-problem-detector-config
```
* **Step 4:** Re-create the node problem detector with the new yaml file:
```shell
kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
kubectl create -f node-problem-detector.yaml
```
***Notice that this approach only applies to node problem detector started with `kubectl`.***
For node problem detector running as cluster addon, because addon manager doesn't support
ConfigMap, configuration overwriting is not supported now.
## Kernel Monitor
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
and detects known kernel issues following predefined rules.
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
The rule list is extensible, and you can always extend it by [overwriting the
configuration](/docs/admin/node-problem/#overwrite-the-configuration).
### Add New NodeConditions
To support new node conditions, you can extend the `conditions` field in
`config/kernel-monitor.json` with new condition definition:
```json
{
"type": "NodeConditionType",
"reason": "CamelCaseDefaultNodeConditionReason",
"message": "arbitrary default node condition message"
}
```
### Detect New Problems
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
with new rule definition:
```json
{
"type": "temporary/permanent",
"condition": "NodeConditionOfPermanentIssue",
"reason": "CamelCaseShortReason",
"message": "regexp matching the issue in the kernel log"
}
```
### Change Log Path
Kernel log in different OS distros may locate in different path. The `log`
field in `config/kernel-monitor.json` is the log path inside the container.
You can always configure it to match your OS distro.
### Support Other Log Format
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
plugin to translate kernel log the internal data structure. It is easy to
implement a new translator for a new log format.
## Caveats
It is recommended to run the node problem detector in your cluster to monitor
the node health. However, you should be aware that this will introduce extra
resource overhead on each node. Usually this is fine, because:
* The kernel log is generated relatively slowly.
* Resource limit is set for node problem detector.
* Even under high load, the resource usage is acceptable.
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
[Monitoring Node Health](/docs/tasks/debug-application-cluster/monitor-node-health/)

View File

@ -6,363 +6,6 @@ assignees:
title: Configuring Out Of Resource Handling
---
* TOC
{:toc}
{% include user-guide-content-moved.md %}
The `kubelet` needs to preserve node stability when available compute resources are low.
This is especially important when dealing with incompressible resources such as memory or disk.
If either resource is exhausted, the node would become unstable.
## Eviction Policy
The `kubelet` can pro-actively monitor for and prevent against total starvation of a compute resource. In those cases, the `kubelet` can pro-actively fail one or more pods in order to reclaim
the starved resource. When the `kubelet` fails a pod, it terminates all containers in the pod, and the `PodPhase`
is transitioned to `Failed`.
### Eviction Signals
The `kubelet` can support the ability to trigger eviction decisions on the signals described in the
table below. The value of each signal is described in the description column based on the `kubelet`
summary API.
| Eviction Signal | Description |
|----------------------------|-----------------------------------------------------------------------|
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
Each of the above signals supports either a literal or percentage based value. The percentage based value
is calculated relative to the total capacity associated with each signal.
`kubelet` supports only two filesystem partitions.
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. `kubelet` does not care about any
other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is
*not OK* to store volumes and logs in a dedicated `filesystem`.
In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/)
support in favor of eviction in response to disk pressure.
### Eviction Thresholds
The `kubelet` supports the ability to specify eviction thresholds that trigger the `kubelet` to reclaim resources.
Each threshold is of the following form:
`<eviction-signal><operator><quantity>`
* valid `eviction-signal` tokens as defined above.
* valid `operator` tokens are `<`
* valid `quantity` tokens must match the quantity representation used by Kubernetes
* an eviction threshold can be expressed as a percentage if ends with `%` token.
For example, if a node has `10Gi` of memory, and the desire is to induce eviction
if available memory falls below `1Gi`, an eviction threshold can be specified as either
of the following (but not both).
* `memory.available<10%`
* `memory.available<1Gi`
#### Soft Eviction Thresholds
A soft eviction threshold pairs an eviction threshold with a required
administrator specified grace period. No action is taken by the `kubelet`
to reclaim resources associated with the eviction signal until that grace
period has been exceeded. If no grace period is provided, the `kubelet` will
error on startup.
In addition, if a soft eviction threshold has been met, an operator can
specify a maximum allowed pod termination grace period to use when evicting
pods from the node. If specified, the `kubelet` will use the lesser value among
the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
If not specified, the `kubelet` will kill pods immediately with no graceful
termination.
To configure soft eviction thresholds, the following flags are supported:
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
corresponding grace period would trigger a pod eviction.
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
pods in response to a soft eviction threshold being met.
#### Hard Eviction Thresholds
A hard eviction threshold has no grace period, and if observed, the `kubelet`
will take immediate action to reclaim the associated starved resource. If a
hard eviction threshold is met, the `kubelet` will kill the pod immediately
with no graceful termination.
To configure hard eviction thresholds, the following flag is supported:
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
would trigger a pod eviction.
The `kubelet` has the following default hard eviction thresholds:
* `--eviction-hard=memory.available<100Mi`
### Eviction Monitoring Interval
The `kubelet` evaluates eviction thresholds per its configured housekeeping interval.
* `housekeeping-interval` is the interval between container housekeepings.
### Node Conditions
The `kubelet` will map one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction threshold has been met
independent of its associated grace period, the `kubelet` will report a condition that
reflects the node is under pressure.
The following node conditions are defined that correspond to the specified eviction signal.
| Node Condition | Eviction Signal | Description |
|-------------------------|-------------------------------|--------------------------------------------|
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
The `kubelet` will continue to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`.
### Oscillation of node conditions
If a node is oscillating above and below a soft eviction threshold, but not exceeding
its associated grace period, it would cause the corresponding node condition to
constantly oscillate between true and false, and could cause poor scheduling decisions
as a consequence.
To protect against this oscillation, the following flag is defined to control how
long the `kubelet` must wait before transitioning out of a pressure condition.
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
to wait before transitioning out of an eviction pressure condition.
The `kubelet` would ensure that it has not observed an eviction threshold being met
for the specified pressure condition for the period specified before toggling the
condition back to `false`.
### Reclaiming node level resources
If an eviction threshold has been met and the grace period has passed,
the `kubelet` will initiate the process of reclaiming the pressured resource
until it has observed the signal has gone below its defined threshold.
The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If
disk pressure is observed, the `kubelet` reclaims node level resources differently if the
machine has a dedicated `imagefs` configured for the container runtime.
#### With Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete dead pods/containers
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete all unused images
#### Without Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete dead pods/containers
1. Delete all unused images
### Evicting end-user pods
If the `kubelet` is unable to reclaim sufficient resource on the node,
it will begin evicting pods.
The `kubelet` ranks pods for eviction as follows:
* by their quality of service
* by the consumption of the starved compute resource relative to the pods scheduling request.
As a result, pod eviction occurs in the following order:
* `BestEffort` pods that consume the most of the starved resource are failed
first.
* `Burstable` pods that consume the greatest amount of the starved resource
relative to their request for that resource are killed first. If no pod
has exceeded its request, the strategy targets the largest consumer of the
starved resource.
* `Guaranteed` pods that consume the greatest amount of the starved resource
relative to their request are killed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.
A `Guaranteed` pod is guaranteed to never be evicted because of another pod's
resource consumption. If a system daemon (i.e. `kubelet`, `docker`, `journald`, etc.)
is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations,
and the node only has `Guaranteed` pod(s) remaining, then the node must choose to evict a
`Guaranteed` pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other `Guaranteed` pod(s).
Local disk is a `BestEffort` resource. If necessary, `kubelet` will evict pods one at a time to reclaim
disk when `DiskPressure` is encountered. The `kubelet` will rank pods by quality of service. If the `kubelet`
is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service
first. If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service
that consumes the largest amount of disk and kill those first.
#### With Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
- local volumes + logs of all its containers.
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
#### Without Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
- local volumes + logs & writable layer of all its containers.
### Minimum eviction reclaim
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
is time consuming.
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below
the configured eviction threshold.
For example, with the following configuration:
```
--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
```
If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure
that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` will work
to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will
work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
on their associated resources.
The default `eviction-minimum-reclaim` is `0` for all resources.
### Scheduler
The node will report a condition when a compute resource is under pressure. The
scheduler views that condition as a signal to dissuade placing additional
pods on the node.
| Node Condition | Scheduler Behavior |
| ---------------- | ------------------------------------------------ |
| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. |
| `DiskPressure` | No new pods are scheduled to the node. |
## Node OOM Behavior
If the node experiences a system OOM (out of memory) event prior to the `kubelet` is able to reclaim memory,
the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respond.
The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the pod.
| Quality of Service | oom_score_adj |
|----------------------------|-----------------------------------------------------------------------|
| `Guaranteed` | -998 |
| `BestEffort` | 1000 |
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` will calculate
an `oom_score` based on the percentage of memory its using on the node, and then add the `oom_score_adj` to get an
effective `oom_score` for the container, and then kills the container with the highest score.
The intended behavior should be that containers with the lowest quality of service that
are consuming the largest amount of memory relative to the scheduling request should be killed first in order
to reclaim memory.
Unlike pod eviction, if a pod container is OOM killed, it may be restarted by the `kubelet` based on its `RestartPolicy`.
## Best Practices
### Schedulable resources and eviction policies
Let's imagine the following scenario:
* Node memory capacity: `10Gi`
* Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
* Operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
To facilitate this scenario, the `kubelet` would be launched as follows:
```
--eviction-hard=memory.available<500Mi
--system-reserved=memory=1.5Gi
```
Implicit in this configuration is the understanding that "System reserved" should include the amount of memory
covered by the eviction threshold.
To reach that capacity, either some pod is using more than its request, or the system is using more than `500Mi`.
This configuration will ensure that the scheduler does not place pods on a node that immediately induce memory pressure
and trigger eviction assuming those pods use less than their configured request.
### DaemonSet
It is never desired for a `kubelet` to evict a pod that was derived from
a `DaemonSet` since the pod will immediately be recreated and rescheduled
back to the same node.
At the moment, the `kubelet` has no ability to distinguish a pod created
from `DaemonSet` versus any other object. If/when that information is
available, the `kubelet` could pro-actively filter those pods from the
candidate set of pods provided to the eviction strategy.
In general, it is strongly recommended that `DaemonSet` not
create `BestEffort` pods to avoid being identified as a candidate pod
for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods.
## Deprecation of existing feature flags to reclaim disk
`kubelet` has been freeing up disk space on demand to keep the node stable.
As disk based eviction matures, the following `kubelet` flags will be marked for deprecation
in favor of the simpler configuration supported around eviction.
| Existing Flag | New Flag |
| ------------- | -------- |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
| `--maximum-dead-containers` | deprecated |
| `--maximum-dead-containers-per-container` | deprecated |
| `--minimum-container-ttl-duration` | deprecated |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
## Known issues
### kubelet may not observe memory pressure right away
The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage
increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller`
will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this
latency, and instead have the kernel tell us when a threshold has been crossed immediately.
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for
this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature
to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
### kubelet may evict more pods than needed
The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.
### How kubelet ranks pods for eviction in response to inode exhaustion
At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
that pod over others.
[Configuring Out of Resource Handling](/docs/concepts/cluster-administration/out-of-resource/)

View File

@ -6,52 +6,6 @@ assignees:
title: Guaranteed Scheduling For Critical Add-On Pods
---
* TOC
{:toc}
{% include user-guide-content-moved.md %}
## Overview
In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine
there are a number of add-ons which, for various reasons, must run on a regular cluster node (rather than the Kubernetes master).
Some of these add-ons are critical to a fully functional cluster, such as Heapster, DNS, and UI.
A cluster may stop working properly if a critical add-on is evicted (either manually or as a side effect of another operation like upgrade)
and becomes pending (for example when the cluster is highly utilized and either there are other pending pods that schedule into the space
vacated by the evicted critical add-on pod or the amount of resources available on the node changed for some other reason).
## Rescheduler: guaranteed scheduling of critical add-ons
Rescheduler ensures that critical add-ons are always scheduled
(assuming the cluster has enough resources to run the critical add-on pods in the absence of regular pods).
If the scheduler determines that no node has enough free resources to run the critical add-on pod
given the pods that are already running in the cluster
(indicated by critical add-on pod's pod condition PodScheduled set to false, the reason set to Unschedulable)
the rescheduler tries to free up space for the add-on by evicting some pods; then the scheduler will schedule the add-on pod.
To avoid situation when another pod is scheduled into the space prepared for the critical add-on,
the chosen node gets a temporary taint "CriticalAddonsOnly" before the eviction(s)
(see [more details](https://github.com/kubernetes/kubernetes/blob/master/docs/design/taint-toleration-dedicated.md)).
Each critical add-on has to tolerate it,
while the other pods shouldn't tolerate the taint. The taint is removed once the add-on is successfully scheduled.
*Warning:* currently there is no guarantee which node is chosen and which pods are being killed
in order to schedule critical pods, so if rescheduler is enabled your pods might be occasionally
killed for this purpose.
## Config
Rescheduler doesn't have any user facing configuration (component config) or API.
It's enabled by default. It can be disabled:
* during cluster setup by setting `ENABLE_RESCHEDULER` flag to `false`
* on running cluster by deleting its manifest from master node
(default path `/etc/kubernetes/manifests/rescheduler.manifest`)
### Marking add-on as critical
To be critical an add-on has to run in `kube-system` namespace (configurable via flag)
and have the following annotations specified:
* `scheduler.alpha.kubernetes.io/critical-pod` set to empty string
* `scheduler.alpha.kubernetes.io/tolerations` set to `[{"key":"CriticalAddonsOnly", "operator":"Exists"}]`
The first one marks a pod a critical. The second one is required by Rescheduler algorithm.
[Guaranteed Scheduling for Critical Add-On Pods](/docs/concepts/cluster-administration/guaranteed-scheduling-critical-addon-pods/)

View File

@ -4,237 +4,6 @@ assignees:
title: Resource Quotas
---
When several users or teams share a cluster with a fixed number of nodes,
there is a concern that one team could use more than its fair share of resources.
{% include user-guide-content-moved.md %}
Resource quotas are a tool for administrators to address this concern.
A resource quota, defined by a `ResourceQuota` object, provides constraints that limit
aggregate resource consumption per namespace. It can limit the quantity of objects that can
be created in a namespace by type, as well as the total amount of compute resources that may
be consumed by resources in that project.
Resource quotas work like this:
- Different teams work in different namespaces. Currently this is voluntary, but
support for making this mandatory via ACLs is planned.
- The administrator creates one or more Resource Quota objects for each namespace.
- Users create resources (pods, services, etc.) in the namespace, and the quota system
tracks usage to ensure it does not exceed hard resource limits defined in a Resource Quota.
- If creating or updating a resource violates a quota constraint, the request will fail with HTTP
status code `403 FORBIDDEN` with a message explaining the constraint that would have been violated.
- If quota is enabled in a namespace for compute resources like `cpu` and `memory`, users must specify
requests or limits for those values; otherwise, the quota system may reject pod creation. Hint: Use
the LimitRange admission controller to force defaults for pods that make no compute resource requirements.
See the [walkthrough](/docs/admin/resourcequota/walkthrough/) for an example to avoid this problem.
Examples of policies that could be created using namespaces and quotas are:
- In a cluster with a capacity of 32 GiB RAM, and 16 cores, let team A use 20 Gib and 10 cores,
let B use 10GiB and 4 cores, and hold 2GiB and 2 cores in reserve for future allocation.
- Limit the "testing" namespace to using 1 core and 1GiB RAM. Let the "production" namespace
use any amount.
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces,
there may be contention for resources. This is handled on a first-come-first-served basis.
Neither contention nor changes to quota will affect already created resources.
## Enabling Resource Quota
Resource Quota support is enabled by default for many Kubernetes distributions. It is
enabled when the apiserver `--admission-control=` flag has `ResourceQuota` as
one of its arguments.
Resource Quota is enforced in a particular namespace when there is a
`ResourceQuota` object in that namespace. There should be at most one
`ResourceQuota` object in a namespace.
## Compute Resource Quota
You can limit the total sum of [compute resources](/docs/user-guide/compute-resources) that can be requested in a given namespace.
The following resource types are supported:
| Resource Name | Description |
| --------------------- | ----------------------------------------------------------- |
| `cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
| `limits.cpu` | Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value. |
| `limits.memory` | Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value. |
| `memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
| `requests.cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
| `requests.memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
## Storage Resource Quota
You can limit the total sum of [storage resources](/docs/user-guide/persistent-volumes) that can be requested in a given namespace.
In addition, you can limit consumption of storage resources based on associated storage-class.
| Resource Name | Description |
| --------------------- | ----------------------------------------------------------- |
| `requests.storage` | Across all persistent volume claims, the sum of storage requests cannot exceed this value. |
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
| `<storage-class-name>.storageclass.storage.k8s.io/requests.storage` | Across all persistent volume claims associated with the storage-class-name, the sum of storage requests cannot exceed this value. |
| `<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims` | Across all persistent volume claims associated with the storage-class-name, the total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
For example, if an operator wants to quota storage with `gold` storage class separate from `bronze` storage class, the operator can
define a quota as follows:
* `gold.storageclass.storage.k8s.io/requests.storage: 500Gi`
* `bronze.storageclass.storage.k8s.io/requests.storage: 100Gi`
## Object Count Quota
The number of objects of a given type can be restricted. The following types
are supported:
| Resource Name | Description |
| ------------------------------- | ------------------------------------------------- |
| `configmaps` | The total number of config maps that can exist in the namespace. |
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
| `pods` | The total number of pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if `status.phase in (Failed, Succeeded)` is true. |
| `replicationcontrollers` | The total number of replication controllers that can exist in the namespace. |
| `resourcequotas` | The total number of [resource quotas](/docs/admin/admission-controllers/#resourcequota) that can exist in the namespace. |
| `services` | The total number of services that can exist in the namespace. |
| `services.loadbalancers` | The total number of services of type load balancer that can exist in the namespace. |
| `services.nodeports` | The total number of services of type node port that can exist in the namespace. |
| `secrets` | The total number of secrets that can exist in the namespace. |
For example, `pods` quota counts and enforces a maximum on the number of `pods`
created in a single namespace.
You might want to set a pods quota on a namespace
to avoid the case where a user creates many small pods and exhausts the cluster's
supply of Pod IPs.
## Quota Scopes
Each quota can have an associated set of scopes. A quota will only measure usage for a resource if it matches
the intersection of enumerated scopes.
When a scope is added to the quota, it limits the number of resources it supports to those that pertain to the scope.
Resources specified on the quota outside of the allowed set results in a validation error.
| Scope | Description |
| ----- | ----------- |
| `Terminating` | Match pods where `spec.activeDeadlineSeconds >= 0` |
| `NotTerminating` | Match pods where `spec.activeDeadlineSeconds is nil` |
| `BestEffort` | Match pods that have best effort quality of service. |
| `NotBestEffort` | Match pods that do not have best effort quality of service. |
The `BestEffort` scope restricts a quota to tracking the following resource: `pods`
The `Terminating`, `NotTerminating`, and `NotBestEffort` scopes restrict a quota to tracking the following resources:
* `cpu`
* `limits.cpu`
* `limits.memory`
* `memory`
* `pods`
* `requests.cpu`
* `requests.memory`
## Requests vs Limits
When allocating compute resources, each container may specify a request and a limit value for either CPU or memory.
The quota can be configured to quota either value.
If the quota has a value specified for `requests.cpu` or `requests.memory`, then it requires that every incoming
container makes an explicit request for those resources. If the quota has a value specified for `limits.cpu` or `limits.memory`,
then it requires that every incoming container specifies an explicit limit for those resources.
## Viewing and Setting Quotas
Kubectl supports creating, updating, and viewing quotas:
```shell
$ kubectl create namespace myspace
$ cat <<EOF > compute-resources.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "4"
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
EOF
$ kubectl create -f ./compute-resources.yaml --namespace=myspace
$ cat <<EOF > object-counts.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-counts
spec:
hard:
configmaps: "10"
persistentvolumeclaims: "4"
replicationcontrollers: "20"
secrets: "10"
services: "10"
services.loadbalancers: "2"
EOF
$ kubectl create -f ./object-counts.yaml --namespace=myspace
$ kubectl get quota --namespace=myspace
NAME AGE
compute-resources 30s
object-counts 32s
$ kubectl describe quota compute-resources --namespace=myspace
Name: compute-resources
Namespace: myspace
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
pods 0 4
requests.cpu 0 1
requests.memory 0 1Gi
$ kubectl describe quota object-counts --namespace=myspace
Name: object-counts
Namespace: myspace
Resource Used Hard
-------- ---- ----
configmaps 0 10
persistentvolumeclaims 0 4
replicationcontrollers 0 20
secrets 1 10
services 0 10
services.loadbalancers 0 2
```
## Quota and Cluster Capacity
Resource Quota objects are independent of the Cluster Capacity. They are
expressed in absolute units. So, if you add nodes to your cluster, this does *not*
automatically give each namespace the ability to consume more resources.
Sometimes more complex policies may be desired, such as:
- proportionally divide total cluster resources among several teams.
- allow each tenant to grow resource usage as needed, but have a generous
limit to prevent accidental resource exhaustion.
- detect demand from one namespace, add nodes, and increase quota.
Such policies could be implemented using ResourceQuota as a building-block, by
writing a 'controller' which watches the quota usage and adjusts the quota
hard limits of each namespace according to other signals.
Note that resource quota divides up aggregate cluster resources, but it creates no
restrictions around nodes: pods from several namespaces may run on the same node.
## Example
See a [detailed example for how to use resource quota](/docs/admin/resourcequota/walkthrough/).
## Read More
See [ResourceQuota design doc](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/admission_control_resource_quota.md) for more information.
[Resource Quotas](/docs/concepts/policy/resource-quotas/)

View File

@ -4,75 +4,8 @@ assignees:
- janetkuo
title: Limiting Storage Consumption
---
This example demonstrates an easy way to limit the amount of storage consumed in a namespace.
The following resources are used in the demonstration:
{% include user-guide-content-moved.md %}
* [Resource Quota](/docs/admin/resourcequota/)
* [Limit Range](/docs/admin/limitrange/)
* [Persistent Volume Claim](/docs/user-guide/persistent-volumes/)
[Limiting Storage Consumption](/docs/tasks/administer-cluster/limit-storage-consumption/)
This example assumes you have a functional Kubernetes setup.
## Limiting Storage Consumption
The cluster-admin is operating a cluster on behalf of a user population and the admin wants to control
how much storage a single namespace can consume in order to control cost.
The admin would like to limit:
1. The number of persistent volume claims in a namespace
2. The amount of storage each claim can request
3. The amount of cumulative storage the namespace can have
## LimitRange to limit requests for storage
Adding a `LimitRange` to a namespace enforces storage request sizes to a minimum and maximum. Storage is requested
via `PersistentVolumeClaim`. The admission controller that enforces limit ranges will reject any PVC that is above or below
the values set by the admin.
In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi max.
```
apiVersion: v1
kind: LimitRange
metadata:
name: storagelimits
spec:
limits:
- type: PersistentVolumeClaim
max:
storage: 2Gi
min:
storage: 1Gi
```
Minimum storage requests are used when the underlying storage provider requires certain minimums. For example,
AWS EBS volumes have a 1Gi minimum requirement.
## StorageQuota to limit PVC count and cumulative storage capacity
Admins can limit the number of PVCs in a namespace as well as the cumulative capacity of those PVCs. New PVCs that exceed
either maximum value will be rejected.
In this example, a 6th PVC in the namespace would be rejected because it exceeds the maximum count of 5. Alternatively,
a 5Gi maximum quota when combined with the 2Gi max limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested
for a namespace capped at 5Gi.
```
apiVersion: v1
kind: ResourceQuota
metadata:
name: storagequota
spec:
hard:
persistentvolumeclaims: "5"
requests.storage: "5Gi"
```
## Summary
A limit range can put a ceiling on how much storage is requested while a resource quota can effectively cap the storage
consumed by a namespace through claim counts and cumulative storage capacity. The allows a cluster-admin to plan their
cluster's storage budget without risk of any one project going over their allotment.

View File

@ -5,362 +5,6 @@ assignees:
title: Applying Resource Quotas and Limits
---
This example demonstrates a typical setup to control for resource usage in a namespace.
{% include user-guide-content-moved.md %}
It demonstrates using the following resources:
* [Namespace](/docs/admin/namespaces)
* [Resource Quota](/docs/admin/resourcequota/)
* [Limit Range](/docs/admin/limitrange/)
This example assumes you have a functional Kubernetes setup.
## Scenario
The cluster-admin is operating a cluster on behalf of a user population and the cluster-admin
wants to control the amount of resources that can be consumed in a particular namespace to promote
fair sharing of the cluster and control cost.
The cluster-admin has the following goals:
* Limit the amount of compute resource for running pods
* Limit the number of persistent volume claims to control access to storage
* Limit the number of load balancers to control cost
* Prevent the use of node ports to preserve scarce resources
* Provide default compute resource requests to enable better scheduling decisions
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called quota-example:
```shell
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
namespace "quota-example" created
$ kubectl get namespaces
NAME STATUS AGE
default Active 2m
kube-system Active 2m
quota-example Active 39s
```
## Step 2: Apply an object-count quota to the namespace
The cluster-admin wants to control the following resources:
* persistent volume claims
* load balancers
* node ports
Let's create a simple quota that controls object counts for those resource types in this namespace.
```shell
$ kubectl create -f docs/admin/resourcequota/object-counts.yaml --namespace=quota-example
resourcequota "object-counts" created
```
The quota system will observe that a quota has been created, and will calculate consumption
in the namespace in response. This should happen quickly.
Let's describe the quota to see what is currently being consumed in this namespace:
```shell
$ kubectl describe quota object-counts --namespace=quota-example
Name: object-counts
Namespace: quota-example
Resource Used Hard
-------- ---- ----
persistentvolumeclaims 0 2
services.loadbalancers 0 2
services.nodeports 0 0
```
The quota system will now prevent users from creating more than the specified amount for each resource.
## Step 3: Apply a compute-resource quota to the namespace
To limit the amount of compute resource that can be consumed in this namespace,
let's create a quota that tracks compute resources.
```shell
$ kubectl create -f docs/admin/resourcequota/compute-resources.yaml --namespace=quota-example
resourcequota "compute-resources" created
```
Let's describe the quota to see what is currently being consumed in this namespace:
```shell
$ kubectl describe quota compute-resources --namespace=quota-example
Name: compute-resources
Namespace: quota-example
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
pods 0 4
requests.cpu 0 1
requests.memory 0 1Gi
```
The quota system will now prevent the namespace from having more than 4 non-terminal pods. In
addition, it will enforce that each container in a pod makes a `request` and defines a `limit` for
`cpu` and `memory`.
## Step 4: Applying default resource requests and limits
Pod authors rarely specify resource requests and limits for their pods.
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
cpu and memory by creating an nginx container.
To demonstrate, lets create a deployment that runs nginx:
```shell
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
deployment "nginx" created
```
Now let's look at the pods that were created.
```shell
$ kubectl get pods --namespace=quota-example
```
What happened? I have no pods! Let's describe the deployment to get a view of what is happening.
```shell
$ kubectl describe deployment nginx --namespace=quota-example
Name: nginx
Namespace: quota-example
CreationTimestamp: Mon, 06 Jun 2016 16:11:37 -0400
Labels: run=nginx
Selector: run=nginx
Replicas: 0 updated | 1 total | 0 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
OldReplicaSets: <none>
NewReplicaSet: nginx-3137573019 (0/1 replicas created)
...
```
A deployment created a corresponding replica set and attempted to size it to create a single pod.
Let's look at the replica set to get more detail.
```shell
$ kubectl describe rs nginx-3137573019 --namespace=quota-example
Name: nginx-3137573019
Namespace: quota-example
Image(s): nginx
Selector: pod-template-hash=3137573019,run=nginx
Labels: pod-template-hash=3137573019
run=nginx
Replicas: 0 current / 1 desired
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4m 7s 11 {replicaset-controller } Warning FailedCreate Error creating: pods "nginx-3137573019-" is forbidden: Failed quota: compute-resources: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
```
The Kubernetes API server is rejecting the replica set requests to create a pod because our pods
do not specify `requests` or `limits` for `cpu` and `memory`.
So let's set some default values for the amount of `cpu` and `memory` a pod can consume:
```shell
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
limitrange "limits" created
$ kubectl describe limits limits --namespace=quota-example
Name: limits
Namespace: quota-example
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container memory - - 256Mi 512Mi -
Container cpu - - 100m 200m -
```
If the Kubernetes API server observes a request to create a pod in this namespace, and the containers
in that pod do not make any compute resource requests, a default request and default limit will be applied
as part of admission control.
In this example, each pod created will have compute resources equivalent to the following:
```shell
$ kubectl run nginx \
--image=nginx \
--replicas=1 \
--requests=cpu=100m,memory=256Mi \
--limits=cpu=200m,memory=512Mi \
--namespace=quota-example
```
Now that we have applied default compute resources for our namespace, our replica set should be able to create
its pods.
```shell
$ kubectl get pods --namespace=quota-example
NAME READY STATUS RESTARTS AGE
nginx-3137573019-fvrig 1/1 Running 0 6m
```
And if we print out our quota usage in the namespace:
```shell
$ kubectl describe quota --namespace=quota-example
Name: compute-resources
Namespace: quota-example
Resource Used Hard
-------- ---- ----
limits.cpu 200m 2
limits.memory 512Mi 2Gi
pods 1 4
requests.cpu 100m 1
requests.memory 256Mi 1Gi
Name: object-counts
Namespace: quota-example
Resource Used Hard
-------- ---- ----
persistentvolumeclaims 0 2
services.loadbalancers 0 2
services.nodeports 0 0
```
As you can see, the pod that was created is consuming explicit amounts of compute resources, and the usage is being
tracked by Kubernetes properly.
## Step 5: Advanced quota scopes
Let's imagine you did not want to specify default compute resource consumption in your namespace.
Instead, you want to let users run a specific number of `BestEffort` pods in their namespace to take
advantage of slack compute resources, and then require that users make an explicit resource request for
pods that require a higher quality of service.
Let's create a new namespace with two quotas to demonstrate this behavior:
```shell
$ kubectl create namespace quota-scopes
namespace "quota-scopes" created
$ kubectl create -f docs/admin/resourcequota/best-effort.yaml --namespace=quota-scopes
resourcequota "best-effort" created
$ kubectl create -f docs/admin/resourcequota/not-best-effort.yaml --namespace=quota-scopes
resourcequota "not-best-effort" created
$ kubectl describe quota --namespace=quota-scopes
Name: best-effort
Namespace: quota-scopes
Scopes: BestEffort
* Matches all pods that have best effort quality of service.
Resource Used Hard
-------- ---- ----
pods 0 10
Name: not-best-effort
Namespace: quota-scopes
Scopes: NotBestEffort
* Matches all pods that do not have best effort quality of service.
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
pods 0 4
requests.cpu 0 1
requests.memory 0 1Gi
```
In this scenario, a pod that makes no compute resource requests will be tracked by the `best-effort` quota.
A pod that does make compute resource requests will be tracked by the `not-best-effort` quota.
Let's demonstrate this by creating two deployments:
```shell
$ kubectl run best-effort-nginx --image=nginx --replicas=8 --namespace=quota-scopes
deployment "best-effort-nginx" created
$ kubectl run not-best-effort-nginx \
--image=nginx \
--replicas=2 \
--requests=cpu=100m,memory=256Mi \
--limits=cpu=200m,memory=512Mi \
--namespace=quota-scopes
deployment "not-best-effort-nginx" created
```
Even though no default limits were specified, the `best-effort-nginx` deployment will create
all 8 pods. This is because it is tracked by the `best-effort` quota, and the `not-best-effort`
quota will just ignore it. The `not-best-effort` quota will track the `not-best-effort-nginx`
deployment since it creates pods with `Burstable` quality of service.
Let's list the pods in the namespace:
```shell
$ kubectl get pods --namespace=quota-scopes
NAME READY STATUS RESTARTS AGE
best-effort-nginx-3488455095-2qb41 1/1 Running 0 51s
best-effort-nginx-3488455095-3go7n 1/1 Running 0 51s
best-effort-nginx-3488455095-9o2xg 1/1 Running 0 51s
best-effort-nginx-3488455095-eyg40 1/1 Running 0 51s
best-effort-nginx-3488455095-gcs3v 1/1 Running 0 51s
best-effort-nginx-3488455095-rq8p1 1/1 Running 0 51s
best-effort-nginx-3488455095-udhhd 1/1 Running 0 51s
best-effort-nginx-3488455095-zmk12 1/1 Running 0 51s
not-best-effort-nginx-2204666826-7sl61 1/1 Running 0 23s
not-best-effort-nginx-2204666826-ke746 1/1 Running 0 23s
```
As you can see, all 10 pods have been allowed to be created.
Let's describe current quota usage in the namespace:
```shell
$ kubectl describe quota --namespace=quota-scopes
Name: best-effort
Namespace: quota-scopes
Scopes: BestEffort
* Matches all pods that have best effort quality of service.
Resource Used Hard
-------- ---- ----
pods 8 10
Name: not-best-effort
Namespace: quota-scopes
Scopes: NotBestEffort
* Matches all pods that do not have best effort quality of service.
Resource Used Hard
-------- ---- ----
limits.cpu 400m 2
limits.memory 1Gi 2Gi
pods 2 4
requests.cpu 200m 1
requests.memory 512Mi 1Gi
```
As you can see, the `best-effort` quota has tracked the usage for the 8 pods we created in
the `best-effort-nginx` deployment, and the `not-best-effort` quota has tracked the usage for
the 2 pods we created in the `not-best-effort-nginx` quota.
Scopes provide a mechanism to subdivide the set of resources that are tracked by
any quota document to allow greater flexibility in how operators deploy and track resource
consumption.
In addition to `BestEffort` and `NotBestEffort` scopes, there are scopes to restrict
long-running versus time-bound pods. The `Terminating` scope will match any pod
where `spec.activeDeadlineSeconds is not nil`. The `NotTerminating` scope will match any pod
where `spec.activeDeadlineSeconds is nil`. These scopes allow you to quota pods based on their
anticipated permanence on a node in your cluster.
## Summary
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined by the namespace quota.
Any action that consumes those resources can be tweaked, or can pick up namespace level defaults to meet your end goal.
Quota can be apportioned based on quality of service and anticipated permanence on a node in your cluster.
[Applying Resource Quotas and Limits](/docs/tasks/configure-pod-container/apply-resource-quota-limit/)

View File

@ -4,125 +4,6 @@ assignees:
title: Static Pods
---
**If you are running clustered Kubernetes and are using static pods to run a pod on every node, you should probably be using a [DaemonSet](/docs/admin/daemons/)!**
{% include user-guide-content-moved.md %}
*Static pods* are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. There is no health check though. Static pods are always bound to one kubelet daemon and always run on the same node with it.
Kubelet automatically creates so-called *mirror pod* on Kubernetes API server for each static pod, so the pods are visible there, but they cannot be controlled from the API server.
## Static pod creation
Static pod can be created in two ways: either by using configuration file(s) or by HTTP.
### Configuration files
The configuration files are just standard pod definition in json or yaml format in specific directory. Use `kubelet --pod-manifest-path=<the directory>` to start kubelet daemon, which periodically scans the directory and creates/deletes static pods as yaml/json files appear/disappear there.
For example, this is how to start a simple web server as a static pod:
1. Choose a node where we want to run the static pod. In this example, it's `my-node1`.
```
[joe@host ~] $ ssh my-node1
```
2. Choose a directory, say `/etc/kubelet.d` and place a web server pod definition there, e.g. `/etc/kubelet.d/static-web.yaml`:
```
[root@my-node1 ~] $ mkdir /etc/kubernetes.d/
[root@my-node1 ~] $ cat <<EOF >/etc/kubernetes.d/static-web.yaml
apiVersion: v1
kind: Pod
metadata:
name: static-web
labels:
role: myrole
spec:
containers:
- name: web
image: nginx
ports:
- name: web
containerPort: 80
protocol: TCP
EOF
```
3. Configure your kubelet daemon on the node to use this directory by running it with `--pod-manifest-path=/etc/kubelet.d/` argument.
On Fedora edit `/etc/kubernetes/kubelet` to include this line:
```
KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --pod-manifest-path=/etc/kubelet.d/"
```
Instructions for other distributions or Kubernetes installations may vary.
4. Restart kubelet. On Fedora, this is:
```
[root@my-node1 ~] $ systemctl restart kubelet
```
## Pods created via HTTP
Kubelet periodically downloads a file specified by `--manifest-url=<URL>` argument and interprets it as a json/yaml file with a pod definition. It works the same as `--pod-manifest-path=<directory>`, i.e. it's reloaded every now and then and changes are applied to running static pods (see below).
## Behavior of static pods
When kubelet starts, it automatically starts all pods defined in directory specified in `--pod-manifest-path=` or `--manifest-url=` arguments, i.e. our static-web. (It may take some time to pull nginx image, be patient…):
```shell
[joe@my-node1 ~] $ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f6d05272b57e nginx:latest "nginx" 8 minutes ago Up 8 minutes k8s_web.6f802af4_static-web-fk-node1_default_67e24ed9466ba55986d120c867395f3c_378e5f3c
```
If we look at our Kubernetes API server (running on host `my-master`), we see that a new mirror-pod was created there too:
```shell
[joe@host ~] $ ssh my-master
[joe@my-master ~] $ kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web-my-node1 1/1 Running 0 2m
```
Labels from the static pod are propagated into the mirror-pod and can be used as usual for filtering.
Notice we cannot delete the pod with the API server (e.g. via [`kubectl`](/docs/user-guide/kubectl/) command), kubelet simply won't remove it.
```shell
[joe@my-master ~] $ kubectl delete pod static-web-my-node1
pods/static-web-my-node1
[joe@my-master ~] $ kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web-my-node1 1/1 Running 0 12s
```
Back to our `my-node1` host, we can try to stop the container manually and see, that kubelet automatically restarts it in a while:
```shell
[joe@host ~] $ ssh my-node1
[joe@my-node1 ~] $ docker stop f6d05272b57e
[joe@my-node1 ~] $ sleep 20
[joe@my-node1 ~] $ docker ps
CONTAINER ID IMAGE COMMAND CREATED ...
5b920cbaf8b1 nginx:latest "nginx -g 'daemon of 2 seconds ago ...
```
## Dynamic addition and removal of static pods
Running kubelet periodically scans the configured directory (`/etc/kubelet.d` in our example) for changes and adds/removes pods as files appear/disappear in this directory.
```shell
[joe@my-node1 ~] $ mv /etc/kubelet.d/static-web.yaml /tmp
[joe@my-node1 ~] $ sleep 20
[joe@my-node1 ~] $ docker ps
// no nginx container is running
[joe@my-node1 ~] $ mv /tmp/static-web.yaml /etc/kubelet.d/
[joe@my-node1 ~] $ sleep 20
[joe@my-node1 ~] $ docker ps
CONTAINER ID IMAGE COMMAND CREATED ...
e7a62e3427f1 nginx:latest "nginx -g 'daemon of 27 seconds ago
```
[Static Pods](/docs/concepts/cluster-administration/static-pod/)

View File

@ -4,119 +4,6 @@ assignees:
title: Using Sysctls in a Kubernetes Cluster
---
* TOC
{:toc}
{% include user-guide-content-moved.md %}
This document describes how sysctls are used within a Kubernetes cluster.
## What is a Sysctl?
In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via the `/proc/sys/` virtual
process file system. The parameters cover various subsystems such as:
- kernel (common prefix: `kernel.`)
- networking (common prefix: `net.`)
- virtual memory (common prefix: `vm.`)
- MDADM (common prefix: `dev.`)
- More subsystems are described in [Kernel docs](https://www.kernel.org/doc/Documentation/sysctl/README).
To get a list of all parameters, you can run
```
$ sudo sysctl -a
```
## Namespaced vs. Node-Level Sysctls
A number of sysctls are _namespaced_ in today's Linux kernels. This means that
they can be set independently for each pod on a node. Being namespaced is a
requirement for sysctls to be accessible in a pod context within Kubernetes.
The following sysctls are known to be _namespaced_:
- `kernel.shm*`,
- `kernel.msg*`,
- `kernel.sem`,
- `fs.mqueue.*`,
- `net.*`.
Sysctls which are not namespaced are called _node-level_ and must be set
manually by the cluster admin, either by means of the underlying Linux
distribution of the nodes (e.g. via `/etc/sysctls.conf`) or using a DaemonSet
with privileged containers.
**Note**: it is good practice to consider nodes with special sysctl settings as
_tainted_ within a cluster, and only schedule pods onto them which need those
sysctl settings. It is suggested to use the Kubernetes [_taints and toleration_
feature](/docs/user-guide/kubectl/kubectl_taint.md) to implement this.
## Safe vs. Unsafe Sysctls
Sysctls are grouped into _safe_ and _unsafe_ sysctls. In addition to proper
namespacing a _safe_ sysctl must be properly _isolated_ between pods on the same
node. This means that setting a _safe_ sysctl for one pod
- must not have any influence on any other pod on the node
- must not allow to harm the node's health
- must not allow to gain CPU or memory resources outside of the resource limits
of a pod.
By far, most of the _namespaced_ sysctls are not necessarily considered _safe_.
For Kubernetes 1.4, the following sysctls are supported in the _safe_ set:
- `kernel.shm_rmid_forced`,
- `net.ipv4.ip_local_port_range`,
- `net.ipv4.tcp_syncookies`.
This list will be extended in future Kubernetes versions when the kubelet
supports better isolation mechanisms.
All _safe_ sysctls are enabled by default.
All _unsafe_ sysctls are disabled by default and must be allowed manually by the
cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be
scheduled, but will fail to launch.
**Warning**: Due to their nature of being _unsafe_, the use of _unsafe_ sysctls
is at-your-own-risk and can lead to severe problems like wrong behavior of
containers, resource shortage or complete breakage of a node.
## Enabling Unsafe Sysctls
With the warning above in mind, the cluster admin can allow certain _unsafe_
sysctls for very special situations like e.g. high-performance or real-time
application tuning. _Unsafe_ sysctls are enabled on a node-by-node basis with a
flag of the kubelet, e.g.:
```shell
$ kubelet --experimental-allowed-unsafe-sysctls 'kernel.msg*,net.ipv4.route.min_pmtu' ...
```
Only _namespaced_ sysctls can be enabled this way.
## Setting Sysctls for a Pod
The sysctl feature is an alpha API in Kubernetes 1.4. Therefore, sysctls are set
using annotations on pods. They apply to all containers in the same pod.
Here is an example, with different annotations for _safe_ and _unsafe_ sysctls:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
annotations:
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
spec:
...
```
**Note**: a pod with the _unsafe_ sysctls specified above will fail to launch on
any node which has not enabled those two _unsafe_ sysctls explicitly. As with
_node-level_ sysctls it is recommended to use [_taints and toleration_
feature](/docs/user-guide/kubectl/kubectl_taint.md) or [labels on nodes](/docs
/user-guide/labels.md) to schedule those pods onto the right nodes.
[Using Sysctls in a Kubernetes Cluster](/docs/concepts/cluster-administration/sysctl-cluster/)

26
docs/admin/upgrade-1-6.md Normal file
View File

@ -0,0 +1,26 @@
---
assignees:
- mml
title: Cluster Management Guide
---
* TOC
{:toc}
This document outlines the potentially disruptive changes that exist in the 1.6 release cycle. Operators, administrators, and developers should
take note of the changes below in order to maintain continuity across their upgrade process.
## Cluster defaults set to etcd 3
In the 1.6 release cycle, the default backend storage layer has been upgraded to fully leverage [etcd 3 capabilities](https://coreos.com/blog/etcd3-a-new-etcd.html) by default.
For new clusters, there is nothing an operator will need to do, it should "just work". However, if you are upgrading from a 1.5 cluster, care should be taken to ensure
continuity.
It is possible to maintain v2 compatibility mode while running etcd 3 for an interim period of time. To do this, you will simply need to update an argument passed to your apiserver during
startup:
```
$ kube-apiserver --storage-backend='etcd2' $(EXISTING_ARGS)
```
However, for long-term maintenance of the cluster, we recommend that the operator plan an outage window in order to perform a [v2->v3 data upgrade](https://coreos.com/etcd/docs/latest/upgrades/upgrade_3_0.html).

View File

@ -6,105 +6,4 @@ assignees:
title: Kubernetes API Overview
---
Primary system and API concepts are documented in the [User guide](/docs/user-guide/).
Overall API conventions are described in the [API conventions doc](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md).
Remote access to the API is discussed in the [access doc](/docs/admin/accessing-the-api).
The Kubernetes API also serves as the foundation for the declarative configuration schema for the system. The [Kubectl](/docs/user-guide/kubectl) command-line tool can be used to create, update, delete, and get API objects.
Kubernetes also stores its serialized state (currently in [etcd](https://coreos.com/docs/distributed-configuration/getting-started-with-etcd/)) in terms of the API resources.
Kubernetes itself is decomposed into multiple components, which interact through its API.
## API changes
In our experience, any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, we expect the Kubernetes API to continuously change and grow. However, we intend to not break compatibility with existing clients, for an extended period of time. In general, new API resources and new resource fields can be expected to be added frequently. Elimination of resources or fields will require following a deprecation process. The precise deprecation policy for eliminating features is TBD, but once we reach our 1.0 milestone, there will be a specific policy.
What constitutes a compatible change and how to change the API are detailed by the [API change document](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md).
## OpenAPI and Swagger definitions
Complete API details are documented using [Swagger v1.2](http://swagger.io/) and [OpenAPI](https://www.openapis.org/). The Kubernetes apiserver (aka "master") exposes an API that can be used to retrieve the Swagger v1.2 Kubernetes API spec located at `/swaggerapi`. You can also enable a UI to browse the API documentation at `/swagger-ui` by passing the `--enable-swagger-ui=true` flag to apiserver.
We also host a version of the [latest v1.2 API documentation UI](http://kubernetes.io/kubernetes/third_party/swagger-ui/). This is updated with the latest release, so if you are using a different version of Kubernetes you will want to use the spec from your apiserver.
Starting with kubernetes 1.4, OpenAPI spec is also available at `/swagger.json`. While we are transitioning from Swagger v1.2 to OpenAPI (aka Swagger v2.0), some of the tools such as kubectl and swagger-ui are still using v1.2 spec. OpenAPI spec is in Beta as of Kubernetes 1.5.
Kubernetes implements an alternative Protobuf based serialization format for the API that is primarily intended for intra-cluster communication, documented in the [design proposal](https://github.com/kubernetes/kubernetes/blob/{{ page.githubbranch }}/docs/proposals/protobuf.md) and the IDL files for each schema are located in the Go packages that define the API objects.
## API versioning
To make it easier to eliminate fields or restructure resource representations, Kubernetes supports
multiple API versions, each at a different API path, such as `/api/v1` or
`/apis/extensions/v1beta1`.
We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-lifed and/or experimental APIs. The JSON and Protobuf serialization schemas follow the same guidelines for schema changes - all descriptions below cover both formats.
Note that API versioning and Software versioning are only indirectly related. The [API and release
versioning proposal](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/versioning.md) describes the relationship between API versioning and
software versioning.
Different API versions imply different levels of stability and support. The criteria for each level are described
in more detail in the [API Changes documentation](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md#alpha-beta-and-stable-versions). They are summarized here:
- Alpha level:
- The version names contain `alpha` (e.g. `v1alpha1`).
- May be buggy. Enabling the feature may expose bugs. Disabled by default.
- Support for feature may be dropped at any time without notice.
- The API may change in incompatible ways in a later software release without notice.
- Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
- Beta level:
- The version names contain `beta` (e.g. `v2beta3`).
- Code is well tested. Enabling the feature is considered safe. Enabled by default.
- Support for the overall feature will not be dropped, though details may change.
- The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens,
we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating
API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
- Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have
multiple clusters which can be upgraded independently, you may be able to relax this restriction.
- **Please do try our beta features and give feedback on them! Once they exit beta, it may not be practical for us to make more changes.**
- Stable level:
- The version name is `vX` where `X` is an integer.
- Stable versions of features will appear in released software for many subsequent versions.
## API groups
To make it easier to extend the Kubernetes API, we implemented [*API groups*](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md).
The API group is specified in a REST path and in the `apiVersion` field of a serialized object.
Currently there are several API groups in use:
1. the "core" (oftentimes called "legacy", due to not having explicit group name) group, which is at
REST path `/api/v1` and is not specified as part of the `apiVersion` field, e.g. `apiVersion: v1`.
1. the named groups are at REST path `/apis/$GROUP_NAME/$VERSION`, and use `apiVersion: $GROUP_NAME/$VERSION`
(e.g. `apiVersion: batch/v1`). Full list of supported API groups can be seen in [Kubernetes API reference](/docs/reference/).
There are two supported paths to extending the API.
1. [Third Party Resources](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/extending-api.md)
are for users with very basic CRUD needs.
1. Coming soon: users needing the full set of Kubernetes API semantics can implement their own apiserver
and use the [aggregator](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/aggregated-api-servers.md)
to make it seamless for clients.
## Enabling API groups
Certain resources and API groups are enabled by default. They can be enabled or disabled by setting `--runtime-config`
on apiserver. `--runtime-config` accepts comma separated values. For ex: to disable batch/v1, set
`--runtime-config=batch/v1=false`, to enable batch/v2alpha1, set `--runtime-config=batch/v2alpha1`.
The flag accepts comma separated set of key=value pairs describing runtime configuration of the apiserver.
IMPORTANT: Enabling or disabling groups or resources requires restarting apiserver and controller-manager
to pick up the `--runtime-config` changes.
## Enabling resources in the groups
DaemonSets, Deployments, HorizontalPodAutoscalers, Ingress, Jobs and ReplicaSets are enabled by default.
Other extensions resources can be enabled by setting `--runtime-config` on
apiserver. `--runtime-config` accepts comma separated values. For ex: to disable deployments and jobs, set
`--runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/ingress=false`
{% include user-guide-content-moved.md %}

View File

@ -0,0 +1,67 @@
---
assignees:
- soltysh
- sttts
title: Auditing
---
* TOC
{:toc}
Kubernetes Audit provides a security-relevant chronological set of records documenting
the sequence of activities that have affected system by individual users, administrators
or other components of the system. It allows cluster administrator to
answer the following questions:
- what happened?
- when did it happen?
- who initiated it?
- on what did it happen?
- where was it observed?
- from where was it initiated?
- to where was it going?
NOTE: Currently, Kubernetes provides only basic audit capabilities, there is still a lot
of work going on to provide fully featured auditing capabilities (see [this issue](https://github.com/kubernetes/features/issues/22)).
Kubernetes audit is part of [kube-apiserver](/docs/admin/kube-apiserver) logging all requests
coming to the server. Each audit log contains two entries:
1. The request line containing:
- unique id allowing to match the response line (see 2)
- source ip of the request
- HTTP method being invoked
- original user invoking the operation
- impersonated user for the operation
- namespace of the request or <none>
- URI as requested
2. The response line containing:
- the unique id from 1
- response code
Example output for user `admin` asking for a list of pods:
```
2016-09-07T13:03:57.400333046Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" ip="127.0.0.1" method="GET" user="admin" as="<self>" namespace="default" uri="/api/v1/namespaces/default/pods"
2016-09-07T13:03:57.400710987Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" response="200"
```
NOTE: The audit capabilities are available *only* for the secured endpoint of the API server.
## Configuration
[Kube-apiserver](/docs/admin/kube-apiserver) provides following options which are responsible
for configuring where and how audit logs are handled:
- `audit-log-path` - enables the audit log pointing to a file where the requests are being logged to.
- `audit-log-maxage` - specifies maximum number of days to retain old audit log files based on the timestamp encoded in their filename.
- `audit-log-maxbackup` - specifies maximum number of old audit log files to retain.
- `audit-log-maxsize` - specifies maximum size in megabytes of the audit log file before it gets rotated. Defaults to 100MB
If an audit log file already exists, Kubernetes appends new audit logs to that file.
Otherwise, Kubernetes creates an audit log file at the location you specified in
`audit-log-path`. If the audit log file exceeds the size you specify in `audit-log-maxsize`,
Kubernetes will rename the current log file by appending the current timestamp on
the file name (before the file extension) and create a new audit log file.
Kubernetes may delete old log files when creating a new log file; you can configure
how many files are retained and how old they can be by specifying the `audit-log-maxbackup`
and `audit-log-maxage` options.

View File

@ -0,0 +1,137 @@
---
title: Federation
---
This guide explains why and how to manage multiple Kubernetes clusters using
federation.
* TOC
{:toc}
## Why federation
Federation makes it easy to manage multiple clusters. It does so by providing 2
major building blocks:
* Sync resources across clusters: Federation provides the ability to keep
resources in multiple clusters in sync. This can be used, for example, to
ensure that the same deployment exists in multiple clusters.
* Cross cluster discovery: It provides the ability to auto-configure DNS
servers and load balancers with backends from all clusters. This can be used,
for example, to ensure that a global VIP or DNS record can be used to access
backends from multiple clusters.
Some other use cases that federation enables are:
* High Availability: By spreading load across clusters and auto configuring DNS
servers and load balancers, federation minimises the impact of cluster
failure.
* Avoiding provider lock-in: By making it easier to migrate applications across
clusters, federation prevents cluster provider lock-in.
Federation is not helpful unless you have multiple clusters. Some of the reasons
why you might want multiple clusters are:
* Low latency: Having clusters in multiple regions minimises latency by serving
users from the cluster that is closest to them.
* Fault isolation: It might be better to have multiple small clusters rather
than a single large cluster for fault isolation (for example: multiple
clusters in different availability zones of a cloud provider).
[Multi cluster guide](/docs/admin/multi-cluster) has more details on this.
* Scalability: There are scalability limits to a single kubernetes cluster (this
should not be the case for most users. For more details:
[Kubernetes Scaling and Performance Goals](https://github.com/kubernetes/community/blob/master/sig-scalability/goals.md)).
* Hybrid cloud: You can have multiple clusters on different cloud providers or
on-premises data centers.
### Caveats
While there are a lot of attractive use cases for federation, there are also
some caveats.
* Increased network bandwidth and cost: The federation control plane watches all
clusters to ensure that the current state is as expected. This can lead to
significant network cost if the clusters are running in different regions on
a cloud provider or on different cloud providers.
* Reduced cross cluster isolation: A bug in the federation control plane can
impact all clusters. This is mitigated by keeping the logic in federation
control plane to a minimum. It mostly delegates to the control plane in
kubernetes clusters whenever it can. The design and implementation also errs
on the side of safety and avoiding multicluster outage.
* Maturity: The federation project is relatively new and is not very mature.
Not all resources are available and many are still alpha. [Issue
38893](https://github.com/kubernetes/kubernetes/issues/38893) ennumerates
known issues with the system that the team is busy solving.
## Setup
To be able to federate multiple clusters, we first need to setup a federation
control plane.
Follow the [setup guide](/docs/admin/federation/) to setup the
federation control plane.
## Hybrid cloud capabilities
Federations of Kubernetes Clusters can include clusters running in
different cloud providers (e.g. Google Cloud, AWS), and on-premises
(e.g. on OpenStack). Simply create all of the clusters that you
require, in the appropriate cloud providers and/or locations, and
register each cluster's API endpoint and credentials with your
Federation API Server (See the
[federation admin guide](/docs/admin/federation/) for details).
Thereafter, your API resources can span different clusters
and cloud providers.
## API resources
Once we have the control plane setup, we can start creating federation API
resources.
The following guides explain some of the resources in detail:
* [ConfigMap](https://kubernetes.io/docs/user-guide/federation/configmap/)
* [DaemonSets](https://kubernetes.io/docs/user-guide/federation/daemonsets/)
* [Deployment](https://kubernetes.io/docs/user-guide/federation/deployment/)
* [Events](https://kubernetes.io/docs/user-guide/federation/events/)
* [Ingress](https://kubernetes.io/docs/user-guide/federation/federated-ingress/)
* [Namespaces](https://kubernetes.io/docs/user-guide/federation/namespaces/)
* [ReplicaSets](https://kubernetes.io/docs/user-guide/federation/replicasets/)
* [Secrets](https://kubernetes.io/docs/user-guide/federation/secrets/)
* [Services](https://kubernetes.io/docs/user-guide/federation/federated-services/)
[API reference docs](/docs/federation/api-reference/) lists all the
resources supported by federation apiserver.
## Cascading deletion
Kubernetes version 1.5 includes support for cascading deletion of federated
resources. With cascading deletion, when you delete a resource from the
federation control plane, the corresponding resources in all underlying clusters
are also deleted.
To enable cascading deletion, set the option
`DeleteOptions.orphanDependents=false` when you delete a resource from the
federation control plane.
The following Federated resources are affected by cascading deletion:
* [Ingress](https://kubernetes.io/docs/user-guide/federation/federated-ingress/)
* [Namespaces](https://kubernetes.io/docs/user-guide/federation/namespaces/)
* [ReplicaSets](https://kubernetes.io/docs/user-guide/federation/replicasets/)
* [Secrets](https://kubernetes.io/docs/user-guide/federation/secrets/)
* [Deployment](https://kubernetes.io/docs/user-guide/federation/deployment/)
* [DaemonSets](https://kubernetes.io/docs/user-guide/federation/daemonsets/)
Note: By default, deleting a resource from federation control plane does not
delete the corresponding resources from underlying clusters.
## For more information
* [Federation
proposal](https://github.com/kubernetes/community/blob/{{page.githubbranch}}/contributors/design-proposals/federation.md)
* [Kubecon2016 talk on federation](https://www.youtube.com/watch?v=pq9lbkmxpS8)

View File

@ -0,0 +1,57 @@
---
assignees:
- davidopp
- filipg
- piosz
title: Guaranteed Scheduling For Critical Add-On Pods
---
* TOC
{:toc}
## Overview
In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine
there are a number of add-ons which, for various reasons, must run on a regular cluster node (rather than the Kubernetes master).
Some of these add-ons are critical to a fully functional cluster, such as Heapster, DNS, and UI.
A cluster may stop working properly if a critical add-on is evicted (either manually or as a side effect of another operation like upgrade)
and becomes pending (for example when the cluster is highly utilized and either there are other pending pods that schedule into the space
vacated by the evicted critical add-on pod or the amount of resources available on the node changed for some other reason).
## Rescheduler: guaranteed scheduling of critical add-ons
Rescheduler ensures that critical add-ons are always scheduled
(assuming the cluster has enough resources to run the critical add-on pods in the absence of regular pods).
If the scheduler determines that no node has enough free resources to run the critical add-on pod
given the pods that are already running in the cluster
(indicated by critical add-on pod's pod condition PodScheduled set to false, the reason set to Unschedulable)
the rescheduler tries to free up space for the add-on by evicting some pods; then the scheduler will schedule the add-on pod.
To avoid situation when another pod is scheduled into the space prepared for the critical add-on,
the chosen node gets a temporary taint "CriticalAddonsOnly" before the eviction(s)
(see [more details](https://github.com/kubernetes/kubernetes/blob/master/docs/design/taint-toleration-dedicated.md)).
Each critical add-on has to tolerate it,
while the other pods shouldn't tolerate the taint. The taint is removed once the add-on is successfully scheduled.
*Warning:* currently there is no guarantee which node is chosen and which pods are being killed
in order to schedule critical pods, so if rescheduler is enabled your pods might be occasionally
killed for this purpose.
## Config
Rescheduler doesn't have any user facing configuration (component config) or API.
It's enabled by default. It can be disabled:
* during cluster setup by setting `ENABLE_RESCHEDULER` flag to `false`
* on running cluster by deleting its manifest from master node
(default path `/etc/kubernetes/manifests/rescheduler.manifest`)
### Marking add-on as critical
To be critical an add-on has to run in `kube-system` namespace (configurable via flag)
and have the following annotations specified:
* `scheduler.alpha.kubernetes.io/critical-pod` set to empty string
* `scheduler.alpha.kubernetes.io/tolerations` set to `[{"key":"CriticalAddonsOnly", "operator":"Exists"}]`
The first one marks a pod a critical. The second one is required by Rescheduler algorithm.

View File

@ -3,6 +3,9 @@ assignees:
- crassirostris
- piosz
title: Logging and Monitoring Cluster Activity
redirect_from:
- "/docs/concepts/clusters/logging/"
- "/docs/concepts/clusters/logging.html"
---
Application and systems logs can help you understand what is happening inside your cluster. The logs are particularly useful for debugging problems and monitoring cluster activity. Most modern applications have some kind of logging mechanism; as such, most container engines are likewise designed to support some kind of logging. The easiest and most embraced logging method for containerized applications is to write to the standard output and standard error streams.
@ -21,7 +24,7 @@ The guidance for cluster-level logging assumes that a logging backend is present
In this section, you can see an example of basic logging in Kubernetes that
outputs data to the standard output stream. This demonstration uses
a [pod specification](/docs/concepts/clusters/counter-pod.yaml) with
a [pod specification](/docs/concepts/cluster-administration/counter-pod.yaml) with
a container that writes some text to standard output once per second.
{% include code.html language="yaml" file="counter-pod.yaml" ghlink="/docs/tasks/debug-application-cluster/counter-pod.yaml" %}
@ -131,7 +134,7 @@ Consider the following example. A pod runs a single container, and the container
writes to two different log files, using two different formats. Here's a
configuration file for the Pod:
{% include code.html language="yaml" file="two-files-counter-pod.yaml" ghlink="/docs/concepts/clusters/two-files-counter-pod.yaml" %}
{% include code.html language="yaml" file="two-files-counter-pod.yaml" ghlink="/docs/concepts/cluster-administration/two-files-counter-pod.yaml" %}
It would be a mess to have log entries of different formats in the same log
stream, even if you managed to redirect both components to the `stdout` stream of
@ -141,7 +144,7 @@ the logs to its own `stdout` stream.
Here's a configuration file for a pod that has two sidecar containers:
{% include code.html language="yaml" file="two-files-counter-pod-streaming-sidecar.yaml" ghlink="/docs/concepts/clusters/two-files-counter-pod-streaming-sidecar.yaml" %}
{% include code.html language="yaml" file="two-files-counter-pod-streaming-sidecar.yaml" ghlink="/docs/concepts/cluster-administration/two-files-counter-pod-streaming-sidecar.yaml" %}
Now when you run this pod, you can access each log stream separately by
running the following commands:
@ -197,7 +200,7 @@ which uses fluentd as a logging agent. Here are two configuration files that
you can use to implement this approach. The first file contains
a [ConfigMap](/docs/user-guide/configmap/) to configure fluentd.
{% include code.html language="yaml" file="fluentd-sidecar-config.yaml" ghlink="/docs/concepts/clusters/fluentd-sidecar-config.yaml" %}
{% include code.html language="yaml" file="fluentd-sidecar-config.yaml" ghlink="/docs/concepts/cluster-administration/fluentd-sidecar-config.yaml" %}
**Note**: The configuration of fluentd is beyond the scope of this article. For
information about configuring fluentd, see the
@ -206,7 +209,7 @@ information about configuring fluentd, see the
The second file describes a pod that has a sidecar container running fluentd.
The pod mounts a volume where fluentd can pick up its configuration data.
{% include code.html language="yaml" file="two-files-counter-pod-agent-sidecar.yaml" ghlink="/docs/concepts/clusters/two-files-counter-pod-agent-sidecar.yaml" %}
{% include code.html language="yaml" file="two-files-counter-pod-agent-sidecar.yaml" ghlink="/docs/concepts/cluster-administration/two-files-counter-pod-agent-sidecar.yaml" %}
After some time you can find log messages in the Stackdriver interface.

View File

@ -0,0 +1,438 @@
---
assignees:
- bgrant0607
- janetkuo
- mikedanese
title: Managing Resources
---
You've deployed your application and exposed it via a service. Now what? Kubernetes provides a number of tools to help you manage your application deployment, including scaling and updating. Among the features we'll discuss in more depth are [configuration files](/docs/user-guide/configuring-containers/#configuration-in-kubernetes) and [labels](/docs/user-guide/deploying-applications/#labels).
You can find all the files for this example [in our docs
repo here](https://github.com/kubernetes/kubernetes.github.io/tree/{{page.docsbranch}}/docs/user-guide/).
* TOC
{:toc}
## Organizing resource configurations
Many applications require multiple resources to be created, such as a Deployment and a Service. Management of multiple resources can be simplified by grouping them together in the same file (separated by `---` in YAML). For example:
{% include code.html language="yaml" file="nginx-app.yaml" ghlink="/docs/user-guide/nginx-app.yaml" %}
Multiple resources can be created the same way as a single resource:
```shell
$ kubectl create -f docs/user-guide/nginx-app.yaml
service "my-nginx-svc" created
deployment "my-nginx" created
```
The resources will be created in the order they appear in the file. Therefore, it's best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.
`kubectl create` also accepts multiple `-f` arguments:
```shell
$ kubectl create -f docs/user-guide/nginx/nginx-svc.yaml -f docs/user-guide/nginx/nginx-deployment.yaml
```
And a directory can be specified rather than or in addition to individual files:
```shell
$ kubectl create -f docs/user-guide/nginx/
```
`kubectl` will read any files with suffixes `.yaml`, `.yml`, or `.json`.
It is a recommended practice to put resources related to the same microservice or application tier into the same file, and to group all of the files associated with your application in the same directory. If the tiers of your application bind to each other using DNS, then you can then simply deploy all of the components of your stack en masse.
A URL can also be specified as a configuration source, which is handy for deploying directly from configuration files checked into github:
```shell
$ kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/docs/user-guide/nginx-deployment.yaml
deployment "nginx-deployment" created
```
## Bulk operations in kubectl
Resource creation isn't the only operation that `kubectl` can perform in bulk. It can also extract resource names from configuration files in order to perform other operations, in particular to delete the same resources you created:
```shell
$ kubectl delete -f docs/user-guide/nginx/
deployment "my-nginx" deleted
service "my-nginx-svc" deleted
```
In the case of just two resources, it's also easy to specify both on the command line using the resource/name syntax:
```shell
$ kubectl delete deployments/my-nginx services/my-nginx-svc
```
For larger numbers of resources, you'll find it easier to specify the selector (label query) specified using `-l` or `--selector`, to filter resources by their labels:
```shell
$ kubectl delete deployment,services -l app=nginx
deployment "my-nginx" deleted
service "my-nginx-svc" deleted
```
Because `kubectl` outputs resource names in the same syntax it accepts, it's easy to chain operations using `$()` or `xargs`:
```shell
$ kubectl get $(kubectl create -f docs/user-guide/nginx/ -o name | grep service)
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx-svc 10.0.0.208 80/TCP 0s
```
With the above commands, we first create resources under docs/user-guide/nginx/ and print the resources created with `-o name` output format
(print each resource as resource/name). Then we `grep` only the "service", and then print it with `kubectl get`.
If you happen to organize your resources across several subdirectories within a particular directory, you can recursively perform the operations on the subdirectories also, by specifying `--recursive` or `-R` alongside the `--filename,-f` flag.
For instance, assume there is a directory `project/k8s/development` that holds all of the manifests needed for the development environment, organized by resource type:
```
project/k8s/development
├── configmap
│   └── my-configmap.yaml
├── deployment
│   └── my-deployment.yaml
└── pvc
└── my-pvc.yaml
```
By default, performing a bulk operation on `project/k8s/development` will stop at the first level of the directory, not processing any subdirectories. If we tried to create the resources in this directory using the following command, we'd encounter an error:
```shell
$ kubectl create -f project/k8s/development
error: you must provide one or more resources by argument or filename (.json|.yaml|.yml|stdin)
```
Instead, specify the `--recursive` or `-R` flag with the `--filename,-f` flag as such:
```shell
$ kubectl create -f project/k8s/development --recursive
configmap "my-config" created
deployment "my-deployment" created
persistentvolumeclaim "my-pvc" created
```
The `--recursive` flag works with any operation that accepts the `--filename,-f` flag such as: `kubectl {create,get,delete,describe,rollout} etc.`
The `--recursive` flag also works when multiple `-f` arguments are provided:
```shell
$ kubectl create -f project/k8s/namespaces -f project/k8s/development --recursive
namespace "development" created
namespace "staging" created
configmap "my-config" created
deployment "my-deployment" created
persistentvolumeclaim "my-pvc" created
```
If you're interested in learning more about `kubectl`, go ahead and read [kubectl Overview](/docs/user-guide/kubectl-overview).
## Using labels effectively
The examples we've used so far apply at most a single label to any resource. There are many scenarios where multiple labels should be used to distinguish sets from one another.
For instance, different applications would use different values for the `app` label, but a multi-tier application, such as the [guestbook example](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/), would additionally need to distinguish each tier. The frontend could carry the following labels:
```yaml
labels:
app: guestbook
tier: frontend
```
while the Redis master and slave would have different `tier` labels, and perhaps even an additional `role` label:
```yaml
labels:
app: guestbook
tier: backend
role: master
```
and
```yaml
labels:
app: guestbook
tier: backend
role: slave
```
The labels allow us to slice and dice our resources along any dimension specified by a label:
```shell
$ kubectl create -f examples/guestbook/all-in-one/guestbook-all-in-one.yaml
$ kubectl get pods -Lapp -Ltier -Lrole
NAME READY STATUS RESTARTS AGE APP TIER ROLE
guestbook-fe-4nlpb 1/1 Running 0 1m guestbook frontend <none>
guestbook-fe-ght6d 1/1 Running 0 1m guestbook frontend <none>
guestbook-fe-jpy62 1/1 Running 0 1m guestbook frontend <none>
guestbook-redis-master-5pg3b 1/1 Running 0 1m guestbook backend master
guestbook-redis-slave-2q2yf 1/1 Running 0 1m guestbook backend slave
guestbook-redis-slave-qgazl 1/1 Running 0 1m guestbook backend slave
my-nginx-divi2 1/1 Running 0 29m nginx <none> <none>
my-nginx-o0ef1 1/1 Running 0 29m nginx <none> <none>
$ kubectl get pods -lapp=guestbook,role=slave
NAME READY STATUS RESTARTS AGE
guestbook-redis-slave-2q2yf 1/1 Running 0 3m
guestbook-redis-slave-qgazl 1/1 Running 0 3m
```
## Canary deployments
Another scenario where multiple labels are needed is to distinguish deployments of different releases or configurations of the same component. It is common practice to deploy a *canary* of a new application release (specified via image tag in the pod template) side by side with the previous release so that the new release can receive live production traffic before fully rolling it out.
For instance, you can use a `track` label to differentiate different releases.
The primary, stable release would have a `track` label with value as `stable`:
```yaml
name: frontend
replicas: 3
...
labels:
app: guestbook
tier: frontend
track: stable
...
image: gb-frontend:v3
```
and then you can create a new release of the guestbook frontend that carries the `track` label with different value (i.e. `canary`), so that two sets of pods would not overlap:
```yaml
name: frontend-canary
replicas: 1
...
labels:
app: guestbook
tier: frontend
track: canary
...
image: gb-frontend:v4
```
The frontend service would span both sets of replicas by selecting the common subset of their labels (i.e. omitting the `track` label), so that the traffic will be redirected to both applications:
```yaml
selector:
app: guestbook
tier: frontend
```
You can tweak the number of replicas of the stable and canary releases to determine the ratio of each release that will receive live production traffic (in this case, 3:1).
Once you're confident, you can update the stable track to the new application release and remove the canary one.
For a more concrete example, check the [tutorial of deploying Ghost](https://github.com/kelseyhightower/talks/tree/master/kubecon-eu-2016/demo#deploy-a-canary).
## Updating labels
Sometimes existing pods and other resources need to be relabeled before creating new resources. This can be done with `kubectl label`.
For example, if you want to label all your nginx pods as frontend tier, simply run:
```shell
$ kubectl label pods -l app=nginx tier=fe
pod "my-nginx-2035384211-j5fhi" labeled
pod "my-nginx-2035384211-u2c7e" labeled
pod "my-nginx-2035384211-u3t6x" labeled
```
This first filters all pods with the label "app=nginx", and then labels them with the "tier=fe".
To see the pods you just labeled, run:
```shell
$ kubectl get pods -l app=nginx -L tier
NAME READY STATUS RESTARTS AGE TIER
my-nginx-2035384211-j5fhi 1/1 Running 0 23m fe
my-nginx-2035384211-u2c7e 1/1 Running 0 23m fe
my-nginx-2035384211-u3t6x 1/1 Running 0 23m fe
```
This outputs all "app=nginx" pods, with an additional label column of pods' tier (specified with `-L` or `--label-columns`).
For more information, please see [labels](/docs/user-guide/labels/) and [kubectl label](/docs/user-guide/kubectl/kubectl_label/) document.
## Updating annotations
Sometimes you would want to attach annotations to resources. Annotations are arbitrary non-identifying metadata for retrieval by API clients such as tools, libraries, etc. This can be done with `kubectl annotate`. For example:
```shell
$ kubectl annotate pods my-nginx-v4-9gw19 description='my frontend running nginx'
$ kubectl get pods my-nginx-v4-9gw19 -o yaml
apiversion: v1
kind: pod
metadata:
annotations:
description: my frontend running nginx
...
```
For more information, please see [annotations](/docs/user-guide/annotations/) and [kubectl annotate](/docs/user-guide/kubectl/kubectl_annotate/) document.
## Scaling your application
When load on your application grows or shrinks, it's easy to scale with `kubectl`. For instance, to decrease the number of nginx replicas from 3 to 1, do:
```shell
$ kubectl scale deployment/my-nginx --replicas=1
deployment "my-nginx" scaled
```
Now you only have one pod managed by the deployment.
```shell
$ kubectl get pods -l app=nginx
NAME READY STATUS RESTARTS AGE
my-nginx-2035384211-j5fhi 1/1 Running 0 30m
```
To have the system automatically choose the number of nginx replicas as needed, ranging from 1 to 3, do:
```shell
$ kubectl autoscale deployment/my-nginx --min=1 --max=3
deployment "my-nginx" autoscaled
```
Now your nginx replicas will be scaled up and down as needed, automatically.
For more information, please see [kubectl scale](/docs/user-guide/kubectl/kubectl_scale/), [kubectl autoscale](/docs/user-guide/kubectl/kubectl_autoscale/) and [horizontal pod autoscaler](/docs/user-guide/horizontal-pod-autoscaler/) document.
## In-place updates of resources
Sometimes it's necessary to make narrow, non-disruptive updates to resources you've created.
### kubectl apply
It is suggested to maintain a set of configuration files in source control (see [configuration as code](http://martinfowler.com/bliki/InfrastructureAsCode.html)),
so that they can be maintained and versioned along with the code for the resources they configure.
Then, you can use [`kubectl apply`](/docs/user-guide/kubectl/kubectl_apply/) to push your configuration changes to the cluster.
This command will compare the version of the configuration that you're pushing with the previous version and apply the changes you've made, without overwriting any automated changes to properties you haven't specified.
```shell
$ kubectl apply -f docs/user-guide/nginx/nginx-deployment.yaml
deployment "my-nginx" configured
```
Note that `kubectl apply` attaches an annotation to the resource in order to determine the changes to the configuration since the previous invocation. When it's invoked, `kubectl apply` does a three-way diff between the previous configuration, the provided input and the current configuration of the resource, in order to determine how to modify the resource.
Currently, resources are created without this annotation, so the first invocation of `kubectl apply` will fall back to a two-way diff between the provided input and the current configuration of the resource. During this first invocation, it cannot detect the deletion of properties set when the resource was created. For this reason, it will not remove them.
All subsequent calls to `kubectl apply`, and other commands that modify the configuration, such as `kubectl replace` and `kubectl edit`, will update the annotation, allowing subsequent calls to `kubectl apply` to detect and perform deletions using a three-way diff.
**Note:** To use apply, always create resource initially with either `kubectl apply` or `kubectl create --save-config`.
### kubectl edit
Alternatively, you may also update resources with `kubectl edit`:
```shell
$ kubectl edit deployment/my-nginx
```
This is equivalent to first `get` the resource, edit it in text editor, and then `apply` the resource with the updated version:
```shell
$ kubectl get deployment my-nginx -o yaml > /tmp/nginx.yaml
$ vi /tmp/nginx.yaml
# do some edit, and then save the file
$ kubectl apply -f /tmp/nginx.yaml
deployment "my-nginx" configured
$ rm /tmp/nginx.yaml
```
This allows you to do more significant changes more easily. Note that you can specify the editor with your `EDITOR` or `KUBE_EDITOR` environment variables.
For more information, please see [kubectl edit](/docs/user-guide/kubectl/kubectl_edit/) document.
### kubectl patch
Suppose you want to fix a typo of the container's image of a Deployment. One way to do that is with `kubectl patch`:
```shell
# Suppose you have a Deployment with a container named "nginx" and its image "nignx" (typo),
# use container name "nginx" as a key to update the image from "nignx" (typo) to "nginx"
$ kubectl get deployment my-nginx -o yaml
```
```yaml
apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
template:
spec:
containers:
- image: nignx
name: nginx
...
```
```shell
$ kubectl patch deployment my-nginx -p'{"spec":{"template":{"spec":{"containers":[{"name":"nginx","image":"nginx"}]}}}}'
"my-nginx" patched
$ kubectl get pod my-nginx-1jgkf -o yaml
```
```yaml
apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
template:
spec:
containers:
- image: nginx
name: nginx
...
```
The patch is specified using json.
The system ensures that you don't clobber changes made by other users or components by confirming that the `resourceVersion` doesn't differ from the version you edited. If you want to update regardless of other changes, remove the `resourceVersion` field when you edit the resource. However, if you do this, don't use your original configuration file as the source since additional fields most likely were set in the live state.
For more information, please see [kubectl patch](/docs/user-guide/kubectl/kubectl_patch/) document.
## Disruptive updates
In some cases, you may need to update resource fields that cannot be updated once initialized, or you may just want to make a recursive change immediately, such as to fix broken pods created by a Deployment. To change such fields, use `replace --force`, which deletes and re-creates the resource. In this case, you can simply modify your original configuration file:
```shell
$ kubectl replace -f docs/user-guide/nginx/nginx-deployment.yaml --force
deployment "my-nginx" deleted
deployment "my-nginx" replaced
```
## Updating your application without a service outage
At some point, you'll eventually need to update your deployed application, typically by specifying a new image or image tag, as in the canary deployment scenario above. `kubectl` supports several update operations, each of which is applicable to different scenarios.
We'll guide you through how to create and update applications with Deployments. If your deployed application is managed by Replication Controllers,
you should read [how to use `kubectl rolling-update`](/docs/tasks/run-application/rolling-update-replication-controller/) instead.
Let's say you were running version 1.7.9 of nginx:
```shell
$ kubectl run my-nginx --image=nginx:1.7.9 --replicas=3
deployment "my-nginx" created
```
To update to version 1.9.1, simply change `.spec.template.spec.containers[0].image` from `nginx:1.7.9` to `nginx:1.9.1`, with the kubectl commands we learned above.
```shell
$ kubectl edit deployment/my-nginx
```
That's it! The Deployment will declaratively update the deployed nginx application progressively behind the scene. It ensures that only a certain number of old replicas may be down while they are being updated, and only a certain number of new replicas may be created above the desired number of pods. To learn more details about it, visit [Deployment page](/docs/user-guide/deployments/).
## What's next?
- [Learn about how to use `kubectl` for application introspection and debugging.](/docs/user-guide/introspection-and-debugging/)
- [Configuration Best Practices and Tips](/docs/concepts/configuration/overview/)

View File

@ -0,0 +1,66 @@
---
assignees:
- davidopp
title: Using Multiple Clusters
---
You may want to set up multiple Kubernetes clusters, both to
have clusters in different regions to be nearer to your users, and to tolerate failures and/or invasive maintenance.
This document describes some of the issues to consider when making a decision about doing so.
If you decide to have multiple clusters, Kubernetes provides a way to [federate them](/docs/admin/federation/).
## Scope of a single cluster
On IaaS providers such as Google Compute Engine or Amazon Web Services, a VM exists in a
[zone](https://cloud.google.com/compute/docs/zones) or [availability
zone](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).
We suggest that all the VMs in a Kubernetes cluster should be in the same availability zone, because:
- compared to having a single global Kubernetes cluster, there are fewer single-points of failure
- compared to a cluster that spans availability zones, it is easier to reason about the availability properties of a
single-zone cluster.
- when the Kubernetes developers are designing the system (e.g. making assumptions about latency, bandwidth, or
correlated failures) they are assuming all the machines are in a single data center, or otherwise closely connected.
It is okay to have multiple clusters per availability zone, though on balance we think fewer is better.
Reasons to prefer fewer clusters are:
- improved bin packing of Pods in some cases with more nodes in one cluster (less resource fragmentation)
- reduced operational overhead (though the advantage is diminished as ops tooling and processes matures)
- reduced costs for per-cluster fixed resource costs, e.g. apiserver VMs (but small as a percentage
of overall cluster cost for medium to large clusters).
Reasons to have multiple clusters include:
- strict security policies requiring isolation of one class of work from another (but, see Partitioning Clusters
below).
- test clusters to canary new Kubernetes releases or other cluster software.
## Selecting the right number of clusters
The selection of the number of Kubernetes clusters may be a relatively static choice, only revisited occasionally.
By contrast, the number of nodes in a cluster and the number of pods in a service may change frequently according to
load and growth.
To pick the number of clusters, first, decide which regions you need to be in to have adequate latency to all your end users, for services that will run
on Kubernetes (if you use a Content Distribution Network, the latency requirements for the CDN-hosted content need not
be considered). Legal issues might influence this as well. For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions.
Call the number of regions to be in `R`.
Second, decide how many clusters should be able to be unavailable at the same time, while still being available. Call
the number that can be unavailable `U`. If you are not sure, then 1 is a fine choice.
If it is allowable for load-balancing to direct traffic to any region in the event of a cluster failure, then
you need at least the larger of `R` or `U + 1` clusters. If it is not (e.g. you want to ensure low latency for all
users in the event of a cluster failure), then you need to have `R * (U + 1)` clusters
(`U + 1` in each of `R` regions). In any case, try to put each cluster in a different zone.
Finally, if any of your clusters would need more than the maximum recommended number of nodes for a Kubernetes cluster, then
you may need even more clusters. Kubernetes v1.3 supports clusters up to 1000 nodes in size.
## Working with multiple clusters
When you have multiple clusters, you would typically create services with the same config in each cluster and put each of those
service instances behind a load balancer (AWS Elastic Load Balancer, GCE Forwarding Rule or HTTP Load Balancer) spanning all of them, so that
failures of a single cluster are not visible to end users.

View File

@ -0,0 +1,73 @@
---
assignees:
- dcbw
- freehan
- thockin
title: Network Plugins
---
* TOC
{:toc}
__Disclaimer__: Network plugins are in alpha. Its contents will change rapidly.
Network plugins in Kubernetes come in a few flavors:
* CNI plugins: adhere to the appc/CNI specification, designed for interoperability.
* Kubenet plugin: implements basic `cbr0` using the `bridge` and `host-local` CNI plugins
## Installation
The kubelet has a single default network plugin, and a default network common to the entire cluster. It probes for plugins when it starts up, remembers what it found, and executes the selected plugin at appropriate times in the pod lifecycle (this is only true for docker, as rkt manages its own CNI plugins). There are two Kubelet command line parameters to keep in mind when using plugins:
* `network-plugin-dir`: Kubelet probes this directory for plugins on startup
* `network-plugin`: The network plugin to use from `network-plugin-dir`. It must match the name reported by a plugin probed from the plugin directory. For CNI plugins, this is simply "cni".
## Network Plugin Requirements
Besides providing the [`NetworkPlugin` interface](https://github.com/kubernetes/kubernetes/tree/{{page.version}}/pkg/kubelet/network/plugins.go) to configure and clean up pod networking, the plugin may also need specific support for kube-proxy. The iptables proxy obviously depends on iptables, and the plugin may need to ensure that container traffic is made available to iptables. For example, if the plugin connects containers to a Linux bridge, the plugin must set the `net/bridge/bridge-nf-call-iptables` sysctl to `1` to ensure that the iptables proxy functions correctly. If the plugin does not use a Linux bridge (but instead something like Open vSwitch or some other mechanism) it should ensure container traffic is appropriately routed for the proxy.
By default if no kubelet network plugin is specified, the `noop` plugin is used, which sets `net/bridge/bridge-nf-call-iptables=1` to ensure simple configurations (like docker with a bridge) work correctly with the iptables proxy.
### CNI
The CNI plugin is selected by passing Kubelet the `--network-plugin=cni` command-line option. Kubelet reads a file from `--cni-conf-dir` (default `/etc/cni/net.d`) and uses the CNI configuration from that file to set up each pod's network. The CNI configuration file must match the [CNI specification](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration), and any required CNI plugins referenced by the configuration must be present in `--cni-bin-dir` (default `/opt/cni/bin`).
If there are multiple CNI configuration files in the directory, the first one in lexicographic order of file name is used.
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI [`lo`](https://github.com/containernetworking/cni/blob/master/plugins/main/loopback/loopback.go) plugin, at minimum version 0.2.0
Limitation: Due to [#31307](https://github.com/kubernetes/kubernetes/issues/31307), `HostPort` won't work with CNI networking plugin at the moment. That means all `hostPort` attribute in pod would be simply ignored.
### kubenet
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of itself, implement more advanced features like cross-node networking or network policy. It is typically used together with a cloud provider that sets up routing rules for communication between nodes, or in single-node environments.
Kubenet creates a Linux bridge named `cbr0` and creates a veth pair for each pod with the host end of each pair connected to `cbr0`. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager. `cbr0` is assigned an MTU matching the smallest MTU of an enabled normal interface on the host.
The plugin requires a few things:
* The standard CNI `bridge`, `lo` and `host-local` plugins are required, at minimum version 0.2.0. Kubenet will first search for them in `/opt/cni/bin`. Specify `network-plugin-dir` to supply additional search path. The first found match will take effect.
* Kubelet must be run with the `--network-plugin=kubenet` argument to enable the plugin
* Kubelet should also be run with the `--non-masquerade-cidr=<clusterCidr>` argument to ensure traffic to IPs outside this range will use IP masquerade.
* The node must be assigned an IP subnet through either the `--pod-cidr` kubelet command-line option or the `--allocate-node-cidrs=true --cluster-cidr=<cidr>` controller-manager command-line options.
### Customizing the MTU (with kubenet)
The MTU should always be configured correctly to get the best networking performance. Network plugins will usually try
to infer a sensible MTU, but sometimes the logic will not result in an optimal MTU. For example, if the
Docker bridge or another interface has a small MTU, kubenet will currently select that MTU. Or if you are
using IPSEC encapsulation, the MTU must be reduced, and this calculation is out-of-scope for
most network plugins.
Where needed, you can specify the MTU explicitly with the `network-plugin-mtu` kubelet option. For example,
on AWS the `eth0` MTU is typically 9001, so you might specify `--network-plugin-mtu=9001`. If you're using IPSEC you
might reduce it to allow for encapsulation overhead e.g. `--network-plugin-mtu=8873`.
This option is provided to the network-plugin; currently **only kubenet supports `network-plugin-mtu`**.
## Usage Summary
* `--network-plugin=cni` specifies that we use the `cni` network plugin with actual CNI plugin binaries located in `--cni-bin-dir` (default `/opt/cni/bin`) and CNI plugin configuration located in `--cni-conf-dir` (default `/etc/cni/net.d`).
* `--network-plugin=kubenet` specifies that we use the `kubenet` network plugin with CNI `bridge` and `host-local` plugins placed in `/opt/cni/bin` or `network-plugin-dir`.
* `--network-plugin-mtu=9001` specifies the MTU to use, currently only used by the `kubenet` network plugin.

View File

@ -0,0 +1,215 @@
---
assignees:
- thockin
title: Cluster Networking
---
Kubernetes approaches networking somewhat differently than Docker does by
default. There are 4 distinct networking problems to solve:
1. Highly-coupled container-to-container communications: this is solved by
[pods](/docs/user-guide/pods/) and `localhost` communications.
2. Pod-to-Pod communications: this is the primary focus of this document.
3. Pod-to-Service communications: this is covered by [services](/docs/user-guide/services/).
4. External-to-Service communications: this is covered by [services](/docs/user-guide/services/).
* TOC
{:toc}
## Summary
Kubernetes assumes that pods can communicate with other pods, regardless of
which host they land on. We give every pod its own IP address so you do not
need to explicitly create links between pods and you almost never need to deal
with mapping container ports to host ports. This creates a clean,
backwards-compatible model where pods can be treated much like VMs or physical
hosts from the perspectives of port allocation, naming, service discovery, load
balancing, application configuration, and migration.
To achieve this we must impose some requirements on how you set up your cluster
networking.
## Docker model
Before discussing the Kubernetes approach to networking, it is worthwhile to
review the "normal" way that networking works with Docker. By default, Docker
uses host-private networking. It creates a virtual bridge, called `docker0` by
default, and allocates a subnet from one of the private address blocks defined
in [RFC1918](https://tools.ietf.org/html/rfc1918) for that bridge. For each
container that Docker creates, it allocates a virtual ethernet device (called
`veth`) which is attached to the bridge. The veth is mapped to appear as `eth0`
in the container, using Linux namespaces. The in-container `eth0` interface is
given an IP address from the bridge's address range.
The result is that Docker containers can talk to other containers only if they
are on the same machine (and thus the same virtual bridge). Containers on
different machines can not reach each other - in fact they may end up with the
exact same network ranges and IP addresses.
In order for Docker containers to communicate across nodes, they must be
allocated ports on the machine's own IP address, which are then forwarded or
proxied to the containers. This obviously means that containers must either
coordinate which ports they use very carefully or else be allocated ports
dynamically.
## Kubernetes model
Coordinating ports across multiple developers is very difficult to do at
scale and exposes users to cluster-level issues outside of their control.
Dynamic port allocation brings a lot of complications to the system - every
application has to take ports as flags, the API servers have to know how to
insert dynamic port numbers into configuration blocks, services have to know
how to find each other, etc. Rather than deal with this, Kubernetes takes a
different approach.
Kubernetes imposes the following fundamental requirements on any networking
implementation (barring any intentional network segmentation policies):
* all containers can communicate with all other containers without NAT
* all nodes can communicate with all containers (and vice-versa) without NAT
* the IP that a container sees itself as is the same IP that others see it as
What this means in practice is that you can not just take two computers
running Docker and expect Kubernetes to work. You must ensure that the
fundamental requirements are met.
This model is not only less complex overall, but it is principally compatible
with the desire for Kubernetes to enable low-friction porting of apps from VMs
to containers. If your job previously ran in a VM, your VM had an IP and could
talk to other VMs in your project. This is the same basic model.
Until now this document has talked about containers. In reality, Kubernetes
applies IP addresses at the `Pod` scope - containers within a `Pod` share their
network namespaces - including their IP address. This means that containers
within a `Pod` can all reach each other's ports on `localhost`. This does imply
that containers within a `Pod` must coordinate port usage, but this is no
different than processes in a VM. We call this the "IP-per-pod" model. This
is implemented in Docker as a "pod container" which holds the network namespace
open while "app containers" (the things the user specified) join that namespace
with Docker's `--net=container:<id>` function.
As with Docker, it is possible to request host ports, but this is reduced to a
very niche operation. In this case a port will be allocated on the host `Node`
and traffic will be forwarded to the `Pod`. The `Pod` itself is blind to the
existence or non-existence of host ports.
## How to achieve this
There are a number of ways that this network model can be implemented. This
document is not an exhaustive study of the various methods, but hopefully serves
as an introduction to various technologies and serves as a jumping-off point.
The following networking options are sorted alphabetically - the order does not
imply any preferential status.
### Contiv
[Contiv](https://github.com/contiv/netplugin) provides configurable networking (native l3 using BGP, overlay using vxlan, classic l2, or Cisco-SDN/ACI) for various use cases. [Contiv](http://contiv.io) is all open sourced.
### Flannel
[Flannel](https://github.com/coreos/flannel#flannel) is a very simple overlay
network that satisfies the Kubernetes requirements. Many
people have reported success with Flannel and Kubernetes.
### Google Compute Engine (GCE)
For the Google Compute Engine cluster configuration scripts, we use [advanced
routing](https://cloud.google.com/compute/docs/networking#routing) to
assign each VM a subnet (default is `/24` - 254 IPs). Any traffic bound for that
subnet will be routed directly to the VM by the GCE network fabric. This is in
addition to the "main" IP address assigned to the VM, which is NAT'ed for
outbound internet access. A linux bridge (called `cbr0`) is configured to exist
on that subnet, and is passed to docker's `--bridge` flag.
We start Docker with:
```shell
DOCKER_OPTS="--bridge=cbr0 --iptables=false --ip-masq=false"
```
This bridge is created by Kubelet (controlled by the `--network-plugin=kubenet`
flag) according to the `Node`'s `spec.podCIDR`.
Docker will now allocate IPs from the `cbr-cidr` block. Containers can reach
each other and `Nodes` over the `cbr0` bridge. Those IPs are all routable
within the GCE project network.
GCE itself does not know anything about these IPs, though, so it will not NAT
them for outbound internet traffic. To achieve that we use an iptables rule to
masquerade (aka SNAT - to make it seem as if packets came from the `Node`
itself) traffic that is bound for IPs outside the GCE project network
(10.0.0.0/8).
```shell
iptables -t nat -A POSTROUTING ! -d 10.0.0.0/8 -o eth0 -j MASQUERADE
```
Lastly we enable IP forwarding in the kernel (so the kernel will process
packets for bridged containers):
```shell
sysctl net.ipv4.ip_forward=1
```
The result of all this is that all `Pods` can reach each other and can egress
traffic to the internet.
### L2 networks and linux bridging
If you have a "dumb" L2 network, such as a simple switch in a "bare-metal"
environment, you should be able to do something similar to the above GCE setup.
Note that these instructions have only been tried very casually - it seems to
work, but has not been thoroughly tested. If you use this technique and
perfect the process, please let us know.
Follow the "With Linux Bridge devices" section of [this very nice
tutorial](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) from
Lars Kellogg-Stedman.
### Nuage Networks VCS (Virtualized Cloud Services)
[Nuage](http://www.nuagenetworks.net) provides a highly scalable policy-based Software-Defined Networking (SDN) platform. Nuage uses the open source Open vSwitch for the data plane along with a feature rich SDN Controller built on open standards.
The Nuage platform uses overlays to provide seamless policy-based networking between Kubernetes Pods and non-Kubernetes environments (VMs and bare metal servers). Nuage's policy abstraction model is designed with applications in mind and makes it easy to declare fine-grained policies for applications.The platform's real-time analytics engine enables visibility and security monitoring for Kubernetes applications.
### OpenVSwitch
[OpenVSwitch](/docs/admin/ovs-networking) is a somewhat more mature but also
complicated way to build an overlay network. This is endorsed by several of the
"Big Shops" for networking.
### OVN (Open Virtual Networking)
OVN is an opensource network virtualization solution developed by the
Open vSwitch community. It lets one create logical switches, logical routers,
stateful ACLs, load-balancers etc to build different virtual networking
topologies. The project has a specific Kubernetes plugin and documentation
at [ovn-kubernetes](https://github.com/openvswitch/ovn-kubernetes).
### Project Calico
[Project Calico](http://docs.projectcalico.org/) is an open source container networking provider and network policy engine.
Calico provides a highly scalable networking and network policy solution for connecting Kubernetes pods based on the same IP networking principles as the internet. Calico can be deployed without encapsulation or overlays to provide high-performance, high-scale data center networking. Calico also provides fine-grained, intent based network security policy for Kubernetes pods via its distributed firewall.
Calico can also be run in policy enforcement mode in conjunction with other networking solutions such as Flannel, aka [canal](https://github.com/tigera/canal), or native GCE networking.
### Romana
[Romana](http://romana.io) is an open source network and security automation solution that lets you deploy Kubernetes without an overlay network. Romana supports Kubernetes [Network Policy](/docs/user-guide/networkpolicies/) to provide isolation across network namespaces.
### Weave Net from Weaveworks
[Weave Net](https://www.weave.works/products/weave-net/) is a
resilient and simple to use network for Kubernetes and its hosted applications.
Weave Net runs as a [CNI plug-in](https://www.weave.works/docs/net/latest/cni-plugin/)
or stand-alone. In either version, it doesn't require any configuration or extra code
to run, and in both cases, the network provides one IP address per pod - as is standard for Kubernetes.
## Other reading
The early design of the networking model and its rationale, and some future
plans are described in more detail in the [networking design
document](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/networking.md).

View File

@ -0,0 +1,29 @@
apiVersion: v1
kind: Service
metadata:
name: my-nginx-svc
labels:
app: nginx
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: nginx
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-nginx
spec:
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80

View File

@ -0,0 +1,368 @@
---
assignees:
- derekwaynecarr
- vishh
- timstclair
title: Configuring Out Of Resource Handling
---
* TOC
{:toc}
The `kubelet` needs to preserve node stability when available compute resources are low.
This is especially important when dealing with incompressible resources such as memory or disk.
If either resource is exhausted, the node would become unstable.
## Eviction Policy
The `kubelet` can pro-actively monitor for and prevent against total starvation of a compute resource. In those cases, the `kubelet` can pro-actively fail one or more pods in order to reclaim
the starved resource. When the `kubelet` fails a pod, it terminates all containers in the pod, and the `PodPhase`
is transitioned to `Failed`.
### Eviction Signals
The `kubelet` can support the ability to trigger eviction decisions on the signals described in the
table below. The value of each signal is described in the description column based on the `kubelet`
summary API.
| Eviction Signal | Description |
|----------------------------|-----------------------------------------------------------------------|
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
Each of the above signals supports either a literal or percentage based value. The percentage based value
is calculated relative to the total capacity associated with each signal.
`kubelet` supports only two filesystem partitions.
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. `kubelet` does not care about any
other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is
*not OK* to store volumes and logs in a dedicated `filesystem`.
In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/)
support in favor of eviction in response to disk pressure.
### Eviction Thresholds
The `kubelet` supports the ability to specify eviction thresholds that trigger the `kubelet` to reclaim resources.
Each threshold is of the following form:
`<eviction-signal><operator><quantity>`
* valid `eviction-signal` tokens as defined above.
* valid `operator` tokens are `<`
* valid `quantity` tokens must match the quantity representation used by Kubernetes
* an eviction threshold can be expressed as a percentage if ends with `%` token.
For example, if a node has `10Gi` of memory, and the desire is to induce eviction
if available memory falls below `1Gi`, an eviction threshold can be specified as either
of the following (but not both).
* `memory.available<10%`
* `memory.available<1Gi`
#### Soft Eviction Thresholds
A soft eviction threshold pairs an eviction threshold with a required
administrator specified grace period. No action is taken by the `kubelet`
to reclaim resources associated with the eviction signal until that grace
period has been exceeded. If no grace period is provided, the `kubelet` will
error on startup.
In addition, if a soft eviction threshold has been met, an operator can
specify a maximum allowed pod termination grace period to use when evicting
pods from the node. If specified, the `kubelet` will use the lesser value among
the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
If not specified, the `kubelet` will kill pods immediately with no graceful
termination.
To configure soft eviction thresholds, the following flags are supported:
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
corresponding grace period would trigger a pod eviction.
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
pods in response to a soft eviction threshold being met.
#### Hard Eviction Thresholds
A hard eviction threshold has no grace period, and if observed, the `kubelet`
will take immediate action to reclaim the associated starved resource. If a
hard eviction threshold is met, the `kubelet` will kill the pod immediately
with no graceful termination.
To configure hard eviction thresholds, the following flag is supported:
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
would trigger a pod eviction.
The `kubelet` has the following default hard eviction thresholds:
* `--eviction-hard=memory.available<100Mi`
### Eviction Monitoring Interval
The `kubelet` evaluates eviction thresholds per its configured housekeeping interval.
* `housekeeping-interval` is the interval between container housekeepings.
### Node Conditions
The `kubelet` will map one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction threshold has been met
independent of its associated grace period, the `kubelet` will report a condition that
reflects the node is under pressure.
The following node conditions are defined that correspond to the specified eviction signal.
| Node Condition | Eviction Signal | Description |
|-------------------------|-------------------------------|--------------------------------------------|
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
The `kubelet` will continue to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`.
### Oscillation of node conditions
If a node is oscillating above and below a soft eviction threshold, but not exceeding
its associated grace period, it would cause the corresponding node condition to
constantly oscillate between true and false, and could cause poor scheduling decisions
as a consequence.
To protect against this oscillation, the following flag is defined to control how
long the `kubelet` must wait before transitioning out of a pressure condition.
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
to wait before transitioning out of an eviction pressure condition.
The `kubelet` would ensure that it has not observed an eviction threshold being met
for the specified pressure condition for the period specified before toggling the
condition back to `false`.
### Reclaiming node level resources
If an eviction threshold has been met and the grace period has passed,
the `kubelet` will initiate the process of reclaiming the pressured resource
until it has observed the signal has gone below its defined threshold.
The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If
disk pressure is observed, the `kubelet` reclaims node level resources differently if the
machine has a dedicated `imagefs` configured for the container runtime.
#### With Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete dead pods/containers
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete all unused images
#### Without Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete dead pods/containers
1. Delete all unused images
### Evicting end-user pods
If the `kubelet` is unable to reclaim sufficient resource on the node,
it will begin evicting pods.
The `kubelet` ranks pods for eviction as follows:
* by their quality of service
* by the consumption of the starved compute resource relative to the pods scheduling request.
As a result, pod eviction occurs in the following order:
* `BestEffort` pods that consume the most of the starved resource are failed
first.
* `Burstable` pods that consume the greatest amount of the starved resource
relative to their request for that resource are killed first. If no pod
has exceeded its request, the strategy targets the largest consumer of the
starved resource.
* `Guaranteed` pods that consume the greatest amount of the starved resource
relative to their request are killed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.
A `Guaranteed` pod is guaranteed to never be evicted because of another pod's
resource consumption. If a system daemon (i.e. `kubelet`, `docker`, `journald`, etc.)
is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations,
and the node only has `Guaranteed` pod(s) remaining, then the node must choose to evict a
`Guaranteed` pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other `Guaranteed` pod(s).
Local disk is a `BestEffort` resource. If necessary, `kubelet` will evict pods one at a time to reclaim
disk when `DiskPressure` is encountered. The `kubelet` will rank pods by quality of service. If the `kubelet`
is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service
first. If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service
that consumes the largest amount of disk and kill those first.
#### With Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
- local volumes + logs of all its containers.
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
#### Without Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
- local volumes + logs & writable layer of all its containers.
### Minimum eviction reclaim
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
is time consuming.
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below
the configured eviction threshold.
For example, with the following configuration:
```
--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
```
If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure
that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` will work
to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will
work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
on their associated resources.
The default `eviction-minimum-reclaim` is `0` for all resources.
### Scheduler
The node will report a condition when a compute resource is under pressure. The
scheduler views that condition as a signal to dissuade placing additional
pods on the node.
| Node Condition | Scheduler Behavior |
| ---------------- | ------------------------------------------------ |
| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. |
| `DiskPressure` | No new pods are scheduled to the node. |
## Node OOM Behavior
If the node experiences a system OOM (out of memory) event prior to the `kubelet` is able to reclaim memory,
the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respond.
The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the pod.
| Quality of Service | oom_score_adj |
|----------------------------|-----------------------------------------------------------------------|
| `Guaranteed` | -998 |
| `BestEffort` | 1000 |
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` will calculate
an `oom_score` based on the percentage of memory its using on the node, and then add the `oom_score_adj` to get an
effective `oom_score` for the container, and then kills the container with the highest score.
The intended behavior should be that containers with the lowest quality of service that
are consuming the largest amount of memory relative to the scheduling request should be killed first in order
to reclaim memory.
Unlike pod eviction, if a pod container is OOM killed, it may be restarted by the `kubelet` based on its `RestartPolicy`.
## Best Practices
### Schedulable resources and eviction policies
Let's imagine the following scenario:
* Node memory capacity: `10Gi`
* Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
* Operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
To facilitate this scenario, the `kubelet` would be launched as follows:
```
--eviction-hard=memory.available<500Mi
--system-reserved=memory=1.5Gi
```
Implicit in this configuration is the understanding that "System reserved" should include the amount of memory
covered by the eviction threshold.
To reach that capacity, either some pod is using more than its request, or the system is using more than `500Mi`.
This configuration will ensure that the scheduler does not place pods on a node that immediately induce memory pressure
and trigger eviction assuming those pods use less than their configured request.
### DaemonSet
It is never desired for a `kubelet` to evict a pod that was derived from
a `DaemonSet` since the pod will immediately be recreated and rescheduled
back to the same node.
At the moment, the `kubelet` has no ability to distinguish a pod created
from `DaemonSet` versus any other object. If/when that information is
available, the `kubelet` could pro-actively filter those pods from the
candidate set of pods provided to the eviction strategy.
In general, it is strongly recommended that `DaemonSet` not
create `BestEffort` pods to avoid being identified as a candidate pod
for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods.
## Deprecation of existing feature flags to reclaim disk
`kubelet` has been freeing up disk space on demand to keep the node stable.
As disk based eviction matures, the following `kubelet` flags will be marked for deprecation
in favor of the simpler configuration supported around eviction.
| Existing Flag | New Flag |
| ------------- | -------- |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
| `--maximum-dead-containers` | deprecated |
| `--maximum-dead-containers-per-container` | deprecated |
| `--minimum-container-ttl-duration` | deprecated |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
## Known issues
### kubelet may not observe memory pressure right away
The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage
increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller`
will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this
latency, and instead have the kernel tell us when a threshold has been crossed immediately.
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for
this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature
to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
### kubelet may evict more pods than needed
The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.
### How kubelet ranks pods for eviction in response to inode exhaustion
At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
that pod over others.

View File

@ -0,0 +1,128 @@
---
assignees:
- jsafrane
title: Static Pods
---
**If you are running clustered Kubernetes and are using static pods to run a pod on every node, you should probably be using a [DaemonSet](/docs/admin/daemons/)!**
*Static pods* are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. There is no health check though. Static pods are always bound to one kubelet daemon and always run on the same node with it.
Kubelet automatically creates so-called *mirror pod* on Kubernetes API server for each static pod, so the pods are visible there, but they cannot be controlled from the API server.
## Static pod creation
Static pod can be created in two ways: either by using configuration file(s) or by HTTP.
### Configuration files
The configuration files are just standard pod definition in json or yaml format in specific directory. Use `kubelet --pod-manifest-path=<the directory>` to start kubelet daemon, which periodically scans the directory and creates/deletes static pods as yaml/json files appear/disappear there.
For example, this is how to start a simple web server as a static pod:
1. Choose a node where we want to run the static pod. In this example, it's `my-node1`.
```
[joe@host ~] $ ssh my-node1
```
2. Choose a directory, say `/etc/kubelet.d` and place a web server pod definition there, e.g. `/etc/kubelet.d/static-web.yaml`:
```
[root@my-node1 ~] $ mkdir /etc/kubernetes.d/
[root@my-node1 ~] $ cat <<EOF >/etc/kubernetes.d/static-web.yaml
apiVersion: v1
kind: Pod
metadata:
name: static-web
labels:
role: myrole
spec:
containers:
- name: web
image: nginx
ports:
- name: web
containerPort: 80
protocol: TCP
EOF
```
3. Configure your kubelet daemon on the node to use this directory by running it with `--pod-manifest-path=/etc/kubelet.d/` argument.
On Fedora edit `/etc/kubernetes/kubelet` to include this line:
```
KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --pod-manifest-path=/etc/kubelet.d/"
```
Instructions for other distributions or Kubernetes installations may vary.
4. Restart kubelet. On Fedora, this is:
```
[root@my-node1 ~] $ systemctl restart kubelet
```
## Pods created via HTTP
Kubelet periodically downloads a file specified by `--manifest-url=<URL>` argument and interprets it as a json/yaml file with a pod definition. It works the same as `--pod-manifest-path=<directory>`, i.e. it's reloaded every now and then and changes are applied to running static pods (see below).
## Behavior of static pods
When kubelet starts, it automatically starts all pods defined in directory specified in `--pod-manifest-path=` or `--manifest-url=` arguments, i.e. our static-web. (It may take some time to pull nginx image, be patient…):
```shell
[joe@my-node1 ~] $ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f6d05272b57e nginx:latest "nginx" 8 minutes ago Up 8 minutes k8s_web.6f802af4_static-web-fk-node1_default_67e24ed9466ba55986d120c867395f3c_378e5f3c
```
If we look at our Kubernetes API server (running on host `my-master`), we see that a new mirror-pod was created there too:
```shell
[joe@host ~] $ ssh my-master
[joe@my-master ~] $ kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web-my-node1 1/1 Running 0 2m
```
Labels from the static pod are propagated into the mirror-pod and can be used as usual for filtering.
Notice we cannot delete the pod with the API server (e.g. via [`kubectl`](/docs/user-guide/kubectl/) command), kubelet simply won't remove it.
```shell
[joe@my-master ~] $ kubectl delete pod static-web-my-node1
pods/static-web-my-node1
[joe@my-master ~] $ kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web-my-node1 1/1 Running 0 12s
```
Back to our `my-node1` host, we can try to stop the container manually and see, that kubelet automatically restarts it in a while:
```shell
[joe@host ~] $ ssh my-node1
[joe@my-node1 ~] $ docker stop f6d05272b57e
[joe@my-node1 ~] $ sleep 20
[joe@my-node1 ~] $ docker ps
CONTAINER ID IMAGE COMMAND CREATED ...
5b920cbaf8b1 nginx:latest "nginx -g 'daemon of 2 seconds ago ...
```
## Dynamic addition and removal of static pods
Running kubelet periodically scans the configured directory (`/etc/kubelet.d` in our example) for changes and adds/removes pods as files appear/disappear in this directory.
```shell
[joe@my-node1 ~] $ mv /etc/kubelet.d/static-web.yaml /tmp
[joe@my-node1 ~] $ sleep 20
[joe@my-node1 ~] $ docker ps
// no nginx container is running
[joe@my-node1 ~] $ mv /tmp/static-web.yaml /etc/kubelet.d/
[joe@my-node1 ~] $ sleep 20
[joe@my-node1 ~] $ docker ps
CONTAINER ID IMAGE COMMAND CREATED ...
e7a62e3427f1 nginx:latest "nginx -g 'daemon of 27 seconds ago
```

View File

@ -0,0 +1,122 @@
---
assignees:
- sttts
title: Using Sysctls in a Kubernetes Cluster
---
* TOC
{:toc}
This document describes how sysctls are used within a Kubernetes cluster.
## What is a Sysctl?
In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via the `/proc/sys/` virtual
process file system. The parameters cover various subsystems such as:
- kernel (common prefix: `kernel.`)
- networking (common prefix: `net.`)
- virtual memory (common prefix: `vm.`)
- MDADM (common prefix: `dev.`)
- More subsystems are described in [Kernel docs](https://www.kernel.org/doc/Documentation/sysctl/README).
To get a list of all parameters, you can run
```
$ sudo sysctl -a
```
## Namespaced vs. Node-Level Sysctls
A number of sysctls are _namespaced_ in today's Linux kernels. This means that
they can be set independently for each pod on a node. Being namespaced is a
requirement for sysctls to be accessible in a pod context within Kubernetes.
The following sysctls are known to be _namespaced_:
- `kernel.shm*`,
- `kernel.msg*`,
- `kernel.sem`,
- `fs.mqueue.*`,
- `net.*`.
Sysctls which are not namespaced are called _node-level_ and must be set
manually by the cluster admin, either by means of the underlying Linux
distribution of the nodes (e.g. via `/etc/sysctls.conf`) or using a DaemonSet
with privileged containers.
**Note**: it is good practice to consider nodes with special sysctl settings as
_tainted_ within a cluster, and only schedule pods onto them which need those
sysctl settings. It is suggested to use the Kubernetes [_taints and toleration_
feature](/docs/user-guide/kubectl/kubectl_taint.md) to implement this.
## Safe vs. Unsafe Sysctls
Sysctls are grouped into _safe_ and _unsafe_ sysctls. In addition to proper
namespacing a _safe_ sysctl must be properly _isolated_ between pods on the same
node. This means that setting a _safe_ sysctl for one pod
- must not have any influence on any other pod on the node
- must not allow to harm the node's health
- must not allow to gain CPU or memory resources outside of the resource limits
of a pod.
By far, most of the _namespaced_ sysctls are not necessarily considered _safe_.
For Kubernetes 1.4, the following sysctls are supported in the _safe_ set:
- `kernel.shm_rmid_forced`,
- `net.ipv4.ip_local_port_range`,
- `net.ipv4.tcp_syncookies`.
This list will be extended in future Kubernetes versions when the kubelet
supports better isolation mechanisms.
All _safe_ sysctls are enabled by default.
All _unsafe_ sysctls are disabled by default and must be allowed manually by the
cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be
scheduled, but will fail to launch.
**Warning**: Due to their nature of being _unsafe_, the use of _unsafe_ sysctls
is at-your-own-risk and can lead to severe problems like wrong behavior of
containers, resource shortage or complete breakage of a node.
## Enabling Unsafe Sysctls
With the warning above in mind, the cluster admin can allow certain _unsafe_
sysctls for very special situations like e.g. high-performance or real-time
application tuning. _Unsafe_ sysctls are enabled on a node-by-node basis with a
flag of the kubelet, e.g.:
```shell
$ kubelet --experimental-allowed-unsafe-sysctls 'kernel.msg*,net.ipv4.route.min_pmtu' ...
```
Only _namespaced_ sysctls can be enabled this way.
## Setting Sysctls for a Pod
The sysctl feature is an alpha API in Kubernetes 1.4. Therefore, sysctls are set
using annotations on pods. They apply to all containers in the same pod.
Here is an example, with different annotations for _safe_ and _unsafe_ sysctls:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
annotations:
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
spec:
...
```
**Note**: a pod with the _unsafe_ sysctls specified above will fail to launch on
any node which has not enabled those two _unsafe_ sysctls explicitly. As with
_node-level_ sysctls it is recommended to use [_taints and toleration_
feature](/docs/user-guide/kubectl/kubectl_taint.md) or [labels on nodes](/docs
/user-guide/labels.md) to schedule those pods onto the right nodes.

View File

@ -0,0 +1,119 @@
---
assignees:
- mikedanese
title: Configuration Best Practices
---
This document is meant to highlight and consolidate in one place configuration best practices that are introduced throughout the user-guide and getting-started documentation and examples. This is a living document so if you think of something that is not on this list but might be useful to others, please don't hesitate to file an issue or submit a PR.
## General Config Tips
- When defining configurations, specify the latest stable API version (currently v1).
- Configuration files should be stored in version control before being pushed to the cluster. This allows a configuration to be quickly rolled back if needed, and will aid with cluster re-creation and restoration if necessary.
- Write your configuration files using YAML rather than JSON. They can be used interchangeably in almost all scenarios, but YAML tends to be more user-friendly for config.
- Group related objects together in a single file where this makes sense. This format is often easier to manage than separate files. See the [guestbook-all-in-one.yaml](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/all-in-one/guestbook-all-in-one.yaml) file as an example of this syntax.
(Note also that many `kubectl` commands can be called on a directory, and so you can also call
`kubectl create` on a directory of config files— see below for more detail).
- Don't specify default values unnecessarily, in order to simplify and minimize configs, and to
reduce error. For example, omit the selector and labels in a `ReplicationController` if you want
them to be the same as the labels in its `podTemplate`, since those fields are populated from the
`podTemplate` labels by default. See the [guestbook app's](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) .yaml files for some [examples](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/frontend-deployment.yaml) of this.
- Put an object description in an annotation to allow better introspection.
## "Naked" Pods vs Replication Controllers and Jobs
- If there is a viable alternative to naked pods (i.e., pods not bound to a [replication controller
](/docs/user-guide/replication-controller)), go with the alternative. Naked pods will not be rescheduled in the
event of node failure.
Replication controllers are almost always preferable to creating pods, except for some explicit
[`restartPolicy: Never`](/docs/user-guide/pod-states/#restartpolicy) scenarios. A
[Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) object (currently in Beta), may also be appropriate.
## Services
- It's typically best to create a [service](/docs/user-guide/services/) before corresponding [replication
controllers](/docs/user-guide/replication-controller/), so that the scheduler can spread the pods comprising the
service. You can also create a replication controller without specifying replicas (this will set
replicas=1), create a service, then scale up the replication controller. This can be useful in
ensuring that one replica works before creating lots of them.
- Don't use `hostPort` (which specifies the port number to expose on the host) unless absolutely
necessary, e.g., for a node daemon. When you bind a Pod to a `hostPort`, there are a limited
number of places that pod can be scheduled, due to port conflicts— you can only schedule as many
such Pods as there are nodes in your Kubernetes cluster.
If you only need access to the port for debugging purposes, you can use the [kubectl proxy and apiserver proxy](/docs/user-guide/connecting-to-applications-proxy/) or [kubectl port-forward](/docs/user-guide/connecting-to-applications-port-forward/).
You can use a [Service](/docs/user-guide/services/) object for external service access.
If you do need to expose a pod's port on the host machine, consider using a [NodePort](/docs/user-guide/services/#type-nodeport) service before resorting to `hostPort`.
- Avoid using `hostNetwork`, for the same reasons as `hostPort`.
- Use _headless services_ for easy service discovery when you don't need kube-proxy load balancing.
See [headless services](/docs/user-guide/services/#headless-services).
## Using Labels
- Define and use [labels](/docs/user-guide/labels/) that identify __semantic attributes__ of your application or
deployment. For example, instead of attaching a label to a set of pods to explicitly represent
some service (e.g., `service: myservice`), or explicitly representing the replication
controller managing the pods (e.g., `controller: mycontroller`), attach labels that identify
semantic attributes, such as `{ app: myapp, tier: frontend, phase: test, deployment: v3 }`. This
will let you select the object groups appropriate to the context— e.g., a service for all "tier:
frontend" pods, or all "test" phase components of app "myapp". See the
[guestbook](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) app for an example of this approach.
A service can be made to span multiple deployments, such as is done across [rolling updates](/docs/user-guide/kubectl/kubectl_rolling-update/), by simply omitting release-specific labels from its selector, rather than updating a service's selector to match the replication controller's selector fully.
- To facilitate rolling updates, include version info in replication controller names, e.g. as a
suffix to the name. It is useful to set a 'version' label as well. The rolling update creates a
new controller as opposed to modifying the existing controller. So, there will be issues with
version-agnostic controller names. See the [documentation](/docs/user-guide/kubectl/kubectl_rolling-update/) on
the rolling-update command for more detail.
Note that the [Deployment](/docs/user-guide/deployments/) object obviates the need to manage replication
controller 'version names'. A desired state of an object is described by a Deployment, and if
changes to that spec are _applied_, the deployment controller changes the actual state to the
desired state at a controlled rate. (Deployment objects are currently part of the [`extensions`
API Group](/docs/api/#api-groups).)
- You can manipulate labels for debugging. Because Kubernetes replication controllers and services
match to pods using labels, this allows you to remove a pod from being considered by a
controller, or served traffic by a service, by removing the relevant selector labels. If you
remove the labels of an existing pod, its controller will create a new pod to take its place.
This is a useful way to debug a previously "live" pod in a quarantine environment. See the
[`kubectl label`](/docs/user-guide/kubectl/kubectl_label/) command.
## Container Images
- The [default container image pull policy](/docs/user-guide/images/) is `IfNotPresent`, which causes the
[Kubelet](/docs/admin/kubelet/) to not pull an image if it already exists. If you would like to
always force a pull, you must specify a pull image policy of `Always` in your .yaml file
(`imagePullPolicy: Always`) or specify a `:latest` tag on your image.
That is, if you're specifying an image with other than the `:latest` tag, e.g. `myimage:v1`, and
there is an image update to that same tag, the Kubelet won't pull the updated image. You can
address this by ensuring that any updates to an image bump the image tag as well (e.g.
`myimage:v2`), and ensuring that your configs point to the correct version.
**Note:** you should avoid using `:latest` tag when deploying containers in production, because this makes it hard
to track which version of the image is running and hard to roll back.
## Using kubectl
- Use `kubectl create -f <directory>` where possible. This looks for config objects in all `.yaml`, `.yml`, and `.json` files in `<directory>` and passes them to `create`.
- Use `kubectl delete` rather than `stop`. `Delete` has a superset of the functionality of `stop`, and `stop` is deprecated.
- Use kubectl bulk operations (via files and/or labels) for get and delete. See [label selectors](/docs/user-guide/labels/#label-selectors) and [using labels effectively](/docs/concepts/cluster-administration/manage-deployment/#using-labels-effectively).
- Use `kubectl run` and `expose` to quickly create and expose single container Deployments. See the [quick start guide](/docs/user-guide/quick-start/) for an example.

View File

@ -0,0 +1,15 @@
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never

View File

@ -0,0 +1,385 @@
---
assignees:
- erictune
- soltysh
title: Run to Completion Finite Workloads
---
* TOC
{:toc}
## What is a Job?
A _job_ creates one or more pods and ensures that a specified number of them successfully terminate.
As pods successfully complete, the _job_ tracks the successful completions. When a specified number
of successful completions is reached, the job itself is complete. Deleting a Job will cleanup the
pods it created.
A simple case is to create one Job object in order to reliably run one Pod to completion.
The Job object will start a new Pod if the first pod fails or is deleted (for example
due to a node hardware failure or a node reboot).
A Job can also be used to run multiple pods in parallel.
### extensions/v1beta1.Job is deprecated
Starting from version 1.5 `extensions/v1beta1.Job` is being deprecated, with a plan to be removed in
version 1.6 of Kubernetes (see this [issue](https://github.com/kubernetes/kubernetes/issues/32763)).
Please use `batch/v1.Job` instead.
## Running an example Job
Here is an example Job config. It computes π to 2000 places and prints it out.
It takes around 10s to complete.
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/user-guide/job.yaml" %}
Run the example job by downloading the example file and then running this command:
```shell
$ kubectl create -f ./job.yaml
job "pi" created
```
Check on the status of the job using this command:
```shell
$ kubectl describe jobs/pi
Name: pi
Namespace: default
Image(s): perl
Selector: controller-uid=b1db589a-2c8d-11e6-b324-0209dc45a495
Parallelism: 1
Completions: 1
Start Time: Tue, 07 Jun 2016 10:56:16 +0200
Labels: controller-uid=b1db589a-2c8d-11e6-b324-0209dc45a495,job-name=pi
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: pi-dtn4q
```
To view completed pods of a job, use `kubectl get pods --show-all`. The `--show-all` will show completed pods too.
To list all the pods that belong to a job in a machine readable form, you can use a command like this:
```shell
$ pods=$(kubectl get pods --show-all --selector=job-name=pi --output=jsonpath={.items..metadata.name})
echo $pods
pi-aiw0a
```
Here, the selector is the same as the selector for the job. The `--output=jsonpath` option specifies an expression
that just gets the name from each pod in the returned list.
View the standard output of one of the pods:
```shell
$ kubectl logs $pods
3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185950244594553469083026425223082533446850352619311881710100031378387528865875332083814206171776691473035982534904287554687311595628638823537875937519577818577805321712268066130019278766111959092164201989380952572010654858632788659361533818279682303019520353018529689957736225994138912497217752834791315155748572424541506959508295331168617278558890750983817546374649393192550604009277016711390098488240128583616035637076601047101819429555961989467678374494482553797747268471040475346462080466842590694912933136770289891521047521620569660240580381501935112533824300355876402474964732639141992726042699227967823547816360093417216412199245863150302861829745557067498385054945885869269956909272107975093029553211653449872027559602364806654991198818347977535663698074265425278625518184175746728909777727938000816470600161452491921732172147723501414419735685481613611573525521334757418494684385233239073941433345477624168625189835694855620992192221842725502542568876717904946016534668049886272327917860857843838279679766814541009538837863609506800642251252051173929848960841284886269456042419652850222106611863067442786220391949450471237137869609563643719172874677646575739624138908658326459958133904780275901
```
## Writing a Job Spec
As with all other Kubernetes config, a Job needs `apiVersion`, `kind`, and `metadata` fields. For
general information about working with config files, see [here](/docs/user-guide/simple-yaml),
[here](/docs/user-guide/configuring-containers), and [here](/docs/user-guide/working-with-resources).
A Job also needs a [`.spec` section](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md#spec-and-status).
### Pod Template
The `.spec.template` is the only required field of the `.spec`.
The `.spec.template` is a [pod template](/docs/user-guide/replication-controller/#pod-template). It has exactly
the same schema as a [pod](/docs/user-guide/pods), except it is nested and does not have an `apiVersion` or
`kind`.
In addition to required fields for a Pod, a pod template in a job must specify appropriate
labels (see [pod selector](#pod-selector)) and an appropriate restart policy.
Only a [`RestartPolicy`](/docs/user-guide/pod-states/#restartpolicy) equal to `Never` or `OnFailure` is allowed.
### Pod Selector
The `.spec.selector` field is optional. In almost all cases you should not specify it.
See section [specifying your own pod selector](#specifying-your-own-pod-selector).
### Parallel Jobs
There are three main types of jobs:
1. Non-parallel Jobs
- normally only one pod is started, unless the pod fails.
- job is complete as soon as Pod terminates successfully.
1. Parallel Jobs with a *fixed completion count*:
- specify a non-zero positive value for `.spec.completions`
- the job is complete when there is one successful pod for each value in the range 1 to `.spec.completions`.
- **not implemented yet:** each pod passed a different index in the range 1 to `.spec.completions`.
1. Parallel Jobs with a *work queue*:
- do not specify `.spec.completions`, default to `.spec.Parallelism`
- the pods must coordinate with themselves or an external service to determine what each should work on
- each pod is independently capable of determining whether or not all its peers are done, thus the entire Job is done.
- when _any_ pod terminates with success, no new pods are created.
- once at least one pod has terminated with success and all pods are terminated, then the job is completed with success.
- once any pod has exited with success, no other pod should still be doing any work or writing any output. They should all be
in the process of exiting.
For a Non-parallel job, you can leave both `.spec.completions` and `.spec.parallelism` unset. When both are
unset, both are defaulted to 1.
For a Fixed Completion Count job, you should set `.spec.completions` to the number of completions needed.
You can set `.spec.parallelism`, or leave it unset and it will default to 1.
For a Work Queue Job, you must leave `.spec.completions` unset, and set `.spec.parallelism` to
a non-negative integer.
For more information about how to make use of the different types of job, see the [job patterns](#job-patterns) section.
#### Controlling Parallelism
The requested parallelism (`.spec.parallelism`) can be set to any non-negative value.
If it is unspecified, it defaults to 1.
If it is specified as 0, then the Job is effectively paused until it is increased.
A job can be scaled up using the `kubectl scale` command. For example, the following
command sets `.spec.parallelism` of a job called `myjob` to 10:
```shell
$ kubectl scale --replicas=$N jobs/myjob
job "myjob" scaled
```
You can also use the `scale` subresource of the Job resource.
Actual parallelism (number of pods running at any instant) may be more or less than requested
parallelism, for a variety or reasons:
- For Fixed Completion Count jobs, the actual number of pods running in parallel will not exceed the number of
remaining completions. Higher values of `.spec.parallelism` are effectively ignored.
- For work queue jobs, no new pods are started after any pod has succeeded -- remaining pods are allowed to complete, however.
- If the controller has not had time to react.
- If the controller failed to create pods for any reason (lack of ResourceQuota, lack of permission, etc.),
then there may be fewer pods than requested.
- The controller may throttle new pod creation due to excessive previous pod failures in the same Job.
- When a pod is gracefully shutdown, it takes time to stop.
## Handling Pod and Container Failures
A Container in a Pod may fail for a number of reasons, such as because the process in it exited with
a non-zero exit code, or the Container was killed for exceeding a memory limit, etc. If this
happens, and the `.spec.template.spec.restartPolicy = "OnFailure"`, then the Pod stays
on the node, but the Container is re-run. Therefore, your program needs to handle the case when it is
restarted locally, or else specify `.spec.template.spec.restartPolicy = "Never"`.
See [pods-states](/docs/user-guide/pod-states) for more information on `restartPolicy`.
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node
(node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the
`.spec.template.spec.restartPolicy = "Never"`. When a Pod fails, then the Job controller
starts a new Pod. Therefore, your program needs to handle the case when it is restarted in a new
pod. In particular, it needs to handle temporary files, locks, incomplete output and the like
caused by previous runs.
Note that even if you specify `.spec.parallelism = 1` and `.spec.completions = 1` and
`.spec.template.spec.restartPolicy = "Never"`, the same program may
sometimes be started twice.
If you do specify `.spec.parallelism` and `.spec.completions` both greater than 1, then there may be
multiple pods running at once. Therefore, your pods must also be tolerant of concurrency.
## Job Termination and Cleanup
When a Job completes, no more Pods are created, but the Pods are not deleted either. Since they are terminated,
they don't show up with `kubectl get pods`, but they will show up with `kubectl get pods -a`. Keeping them around
allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output.
The job object also remains after it is completed so that you can view its status. It is up to the user to delete
old jobs after noting their status. Delete the job with `kubectl` (e.g. `kubectl delete jobs/pi` or `kubectl delete -f ./job.yaml`). When you delete the job using `kubectl`, all the pods it created are deleted too.
If a Job's pods are failing repeatedly, the Job will keep creating new pods forever, by default.
Retrying forever can be a useful pattern. If an external dependency of the Job's
pods is missing (for example an input file on a networked storage volume is not present), then the
Job will keep trying Pods, and when you later resolve the external dependency (for example, creating
the missing file) the Job will then complete without any further action.
However, if you prefer not to retry forever, you can set a deadline on the job. Do this by setting the
`spec.activeDeadlineSeconds` field of the job to a number of seconds. The job will have status with
`reason: DeadlineExceeded`. No more pods will be created, and existing pods will be deleted.
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
activeDeadlineSeconds: 100
template:
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
```
Note that both the Job Spec and the Pod Template Spec within the Job have a field with the same name.
Set the one on the Job.
## Job Patterns
The Job object can be used to support reliable parallel execution of Pods. The Job object is not
designed to support closely-communicating parallel processes, as commonly found in scientific
computing. It does support parallel processing of a set of independent but related *work items*.
These might be emails to be sent, frames to be rendered, files to be transcoded, ranges of keys in a
NoSQL database to scan, and so on.
In a complex system, there may be multiple different sets of work items. Here we are just
considering one set of work items that the user wants to manage together &mdash; a *batch job*.
There are several different patterns for parallel computation, each with strengths and weaknesses.
The tradeoffs are:
- One Job object for each work item, vs. a single Job object for all work items. The latter is
better for large numbers of work items. The former creates some overhead for the user and for the
system to manage large numbers of Job objects. Also, with the latter, the resource usage of the job
(number of concurrently running pods) can be easily adjusted using the `kubectl scale` command.
- Number of pods created equals number of work items, vs. each pod can process multiple work items.
The former typically requires less modification to existing code and containers. The latter
is better for large numbers of work items, for similar reasons to the previous bullet.
- Several approaches use a work queue. This requires running a queue service,
and modifications to the existing program or container to make it use the work queue.
Other approaches are easier to adapt to an existing containerised application.
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs.
The pattern names are also links to examples and more detailed description.
| Pattern | Single Job object | Fewer pods than work items? | Use app unmodified? | Works in Kube 1.1? |
| -------------------------------------------------------------------- |:-----------------:|:---------------------------:|:-------------------:|:-------------------:|
| [Job Template Expansion](/docs/user-guide/jobs/expansions) | | | ✓ | ✓ |
| [Queue with Pod Per Work Item](/docs/tasks/job/work-queue-1/) | ✓ | | sometimes | ✓ |
| [Queue with Variable Pod Count](/docs/tasks/job/fine-parallel-processing-work-queue/) | ✓ | ✓ | | ✓ |
| Single Job with Static Work Assignment | ✓ | | ✓ | |
When you specify completions with `.spec.completions`, each Pod created by the Job controller
has an identical [`spec`](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md#spec-and-status). This means that
all pods will have the same command line and the same
image, the same volumes, and (almost) the same environment variables. These patterns
are different ways to arrange for pods to work on different things.
This table shows the required settings for `.spec.parallelism` and `.spec.completions` for each of the patterns.
Here, `W` is the number of work items.
| Pattern | `.spec.completions` | `.spec.parallelism` |
| -------------------------------------------------------------------- |:-------------------:|:--------------------:|
| [Job Template Expansion](/docs/tasks/job/parallel-processing-expansion/) | 1 | should be 1 |
| [Queue with Pod Per Work Item](/docs/tasks/job/work-queue-1/) | W | any |
| [Queue with Variable Pod Count](/docs/tasks/job/fine-parallel-processing-work-queue/) | 1 | any |
| Single Job with Static Work Assignment | W | any |
## Advanced Usage
### Specifying your own pod selector
Normally, when you create a job object, you do not specify `spec.selector`.
The system defaulting logic adds this field when the job is created.
It picks a selector value that will not overlap with any other jobs.
However, in some cases, you might need to override this automatically set selector.
To do this, you can specify the `spec.selector` of the job.
Be very careful when doing this. If you specify a label selector which is not
unique to the pods of that job, and which matches unrelated pods, then pods of the unrelated
job may be deleted, or this job may count other pods as completing it, or one or both
of the jobs may refuse to create pods or run to completion. If a non-unique selector is
chosen, then other controllers (e.g. ReplicationController) and their pods may behave
in unpredicatable ways too. Kubernetes will not stop you from making a mistake when
specifying `spec.selector`.
Here is an example of a case when you might want to use this feature.
Say job `old` is already running. You want existing pods
to keep running, but you want the rest of the pods it creates
to use a different pod template and for the job to have a new name.
You cannot update the job because these fields are not updatable.
Therefore, you delete job `old` but leave its pods
running, using `kubectl delete jobs/old-one --cascade=false`.
Before deleting it, you make a note of what selector it uses:
```
kind: Job
metadata:
name: old
...
spec:
selector:
matchLabels:
job-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
```
Then you create a new job with name `new` and you explicitly specify the same selector.
Since the existing pods have label `job-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002`,
they are controlled by job `new` as well.
You need to specify `manualSelector: true` in the new job since you are not using
the selector that the system normally generates for you automatically.
```
kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
job-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
```
The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010af00002`. Setting
`manualSelector: true` tells the system to that you know what you are doing and to allow this
mismatch.
## Alternatives
### Bare Pods
When the node that a pod is running on reboots or fails, the pod is terminated
and will not be restarted. However, a Job will create new pods to replace terminated ones.
For this reason, we recommend that you use a job rather than a bare pod, even if your application
requires only a single pod.
### Replication Controller
Jobs are complementary to [Replication Controllers](/docs/user-guide/replication-controller).
A Replication Controller manages pods which are not expected to terminate (e.g. web servers), and a Job
manages pods that are expected to terminate (e.g. batch jobs).
As discussed in [life of a pod](/docs/user-guide/pod-states), `Job` is *only* appropriate for pods with
`RestartPolicy` equal to `OnFailure` or `Never`. (Note: If `RestartPolicy` is not set, the default
value is `Always`.)
### Single Job starts Controller Pod
Another pattern is for a single Job to create a pod which then creates other pods, acting as a sort
of custom controller for those pods. This allows the most flexibility, but may be somewhat
complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn
starts a Spark master controller (see [spark example](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/spark/README.md)), runs a spark
driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job
object, but complete control over what pods are created and how work is assigned to them.
## Cron Jobs
Support for creating Jobs at specified times/dates (i.e. cron) is available in Kubernetes [1.4](https://github.com/kubernetes/kubernetes/pull/11980). More information is available in the [cron job documents](http://kubernetes.io/docs/user-guide/cron-jobs/)

View File

@ -0,0 +1,136 @@
---
assignees:
- lavalamp
title: Kubernetes Components
---
This document outlines the various binary components that need to run to
deliver a functioning Kubernetes cluster.
## Master Components
Master components are those that provide the cluster's control plane. For
example, master components are responsible for making global decisions about the
cluster (e.g., scheduling), and detecting and responding to cluster events
(e.g., starting up a new pod when a replication controller's 'replicas' field is
unsatisfied).
In theory, Master components can be run on any node in the cluster. However,
for simplicity, current set up scripts typically start all master components on
the same VM, and does not run user containers on this VM. See
[high-availability.md](/docs/admin/high-availability) for an example multi-master-VM setup.
Even in the future, when Kubernetes is fully self-hosting, it will probably be
wise to only allow master components to schedule on a subset of nodes, to limit
co-running with user-run pods, reducing the possible scope of a
node-compromising security exploit.
### kube-apiserver
[kube-apiserver](/docs/admin/kube-apiserver) exposes the Kubernetes API; it is the front-end for the
Kubernetes control plane. It is designed to scale horizontally (i.e., one scales
it by running more of them-- [high-availability.md](/docs/admin/high-availability)).
### etcd
[etcd](/docs/admin/etcd) is used as Kubernetes' backing store. All cluster data is stored here.
Proper administration of a Kubernetes cluster includes a backup plan for etcd's
data.
### kube-controller-manager
[kube-controller-manager](/docs/admin/kube-controller-manager) is a binary that runs controllers, which are the
background threads that handle routine tasks in the cluster. Logically, each
controller is a separate process, but to reduce the number of moving pieces in
the system, they are all compiled into a single binary and run in a single
process.
These controllers include:
* Node Controller: Responsible for noticing & responding when nodes go down.
* Replication Controller: Responsible for maintaining the correct number of pods for every replication
controller object in the system.
* Endpoints Controller: Populates the Endpoints object (i.e., join Services & Pods).
* Service Account & Token Controllers: Create default accounts and API access tokens for new namespaces.
* ... and others.
### kube-scheduler
[kube-scheduler](/docs/admin/kube-scheduler) watches newly created pods that have no node assigned, and
selects a node for them to run on.
### addons
Addons are pods and services that implement cluster features. The pods may be managed
by Deployments, ReplicationContollers, etc. Namespaced addon objects are created in
the "kube-system" namespace.
Addon manager takes the responsibility for creating and maintaining addon resources.
See [here](http://releases.k8s.io/HEAD/cluster/addons) for more details.
#### DNS
While the other addons are not strictly required, all Kubernetes
clusters should have [cluster DNS](/docs/admin/dns/), as many examples rely on it.
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your
environment, which serves DNS records for Kubernetes services.
Containers started by Kubernetes automatically include this DNS server
in their DNS searches.
#### User interface
The kube-ui provides a read-only overview of the cluster state. Access
[the UI using kubectl proxy](/docs/user-guide/connecting-to-applications-proxy/#connecting-to-the-kube-ui-service-from-your-local-workstation)
#### Container Resource Monitoring
[Container Resource Monitoring](/docs/user-guide/monitoring) records generic time-series metrics
about containers in a central database, and provides a UI for browsing that data.
#### Cluster-level Logging
A [Cluster-level logging](/docs/user-guide/logging/overview) mechanism is responsible for
saving container logs to a central log store with search/browsing interface.
## Node components
Node components run on every node, maintaining running pods and providing them
the Kubernetes runtime environment.
### kubelet
[kubelet](/docs/admin/kubelet) is the primary node agent. It:
* Watches for pods that have been assigned to its node (either by apiserver
or via local configuration file) and:
* Mounts the pod's required volumes
* Downloads the pod's secrets
* Runs the pod's containers via docker (or, experimentally, rkt).
* Periodically executes any requested container liveness probes.
* Reports the status of the pod back to the rest of the system, by creating a
"mirror pod" if necessary.
* Reports the status of the node back to the rest of the system.
### kube-proxy
[kube-proxy](/docs/admin/kube-proxy) enables the Kubernetes service abstraction by maintaining
network rules on the host and performing connection forwarding.
### docker
`docker` is of course used for actually running containers.
### rkt
`rkt` is supported experimentally as an alternative to docker.
### supervisord
`supervisord` is a lightweight process babysitting system for keeping kubelet and docker
running.
### fluentd
`fluentd` is a daemon which helps provide [cluster-level logging](#cluster-level-logging).

View File

@ -0,0 +1,109 @@
---
assignees:
- bgrant0607
- erictune
- lavalamp
title: The Kubernetes API
---
Primary system and API concepts are documented in the [User guide](/docs/user-guide/).
Overall API conventions are described in the [API conventions doc](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md).
Remote access to the API is discussed in the [access doc](/docs/admin/accessing-the-api).
The Kubernetes API also serves as the foundation for the declarative configuration schema for the system. The [Kubectl](/docs/user-guide/kubectl) command-line tool can be used to create, update, delete, and get API objects.
Kubernetes also stores its serialized state (currently in [etcd](https://coreos.com/docs/distributed-configuration/getting-started-with-etcd/)) in terms of the API resources.
Kubernetes itself is decomposed into multiple components, which interact through its API.
## API changes
In our experience, any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, we expect the Kubernetes API to continuously change and grow. However, we intend to not break compatibility with existing clients, for an extended period of time. In general, new API resources and new resource fields can be expected to be added frequently. Elimination of resources or fields will require following a deprecation process. The precise deprecation policy for eliminating features is TBD, but once we reach our 1.0 milestone, there will be a specific policy.
What constitutes a compatible change and how to change the API are detailed by the [API change document](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md).
## OpenAPI and Swagger definitions
Complete API details are documented using [Swagger v1.2](http://swagger.io/) and [OpenAPI](https://www.openapis.org/). The Kubernetes apiserver (aka "master") exposes an API that can be used to retrieve the Swagger v1.2 Kubernetes API spec located at `/swaggerapi`. You can also enable a UI to browse the API documentation at `/swagger-ui` by passing the `--enable-swagger-ui=true` flag to apiserver.
We also host a version of the [latest v1.2 API documentation UI](http://kubernetes.io/kubernetes/third_party/swagger-ui/). This is updated with the latest release, so if you are using a different version of Kubernetes you will want to use the spec from your apiserver.
Starting with kubernetes 1.4, OpenAPI spec is also available at `/swagger.json`. While we are transitioning from Swagger v1.2 to OpenAPI (aka Swagger v2.0), some of the tools such as kubectl and swagger-ui are still using v1.2 spec. OpenAPI spec is in Beta as of Kubernetes 1.5.
Kubernetes implements an alternative Protobuf based serialization format for the API that is primarily intended for intra-cluster communication, documented in the [design proposal](https://github.com/kubernetes/kubernetes/blob/{{ page.githubbranch }}/docs/proposals/protobuf.md) and the IDL files for each schema are located in the Go packages that define the API objects.
## API versioning
To make it easier to eliminate fields or restructure resource representations, Kubernetes supports
multiple API versions, each at a different API path, such as `/api/v1` or
`/apis/extensions/v1beta1`.
We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-lifed and/or experimental APIs. The JSON and Protobuf serialization schemas follow the same guidelines for schema changes - all descriptions below cover both formats.
Note that API versioning and Software versioning are only indirectly related. The [API and release
versioning proposal](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/versioning.md) describes the relationship between API versioning and
software versioning.
Different API versions imply different levels of stability and support. The criteria for each level are described
in more detail in the [API Changes documentation](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md#alpha-beta-and-stable-versions). They are summarized here:
- Alpha level:
- The version names contain `alpha` (e.g. `v1alpha1`).
- May be buggy. Enabling the feature may expose bugs. Disabled by default.
- Support for feature may be dropped at any time without notice.
- The API may change in incompatible ways in a later software release without notice.
- Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
- Beta level:
- The version names contain `beta` (e.g. `v2beta3`).
- Code is well tested. Enabling the feature is considered safe. Enabled by default.
- Support for the overall feature will not be dropped, though details may change.
- The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens,
we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating
API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
- Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have
multiple clusters which can be upgraded independently, you may be able to relax this restriction.
- **Please do try our beta features and give feedback on them! Once they exit beta, it may not be practical for us to make more changes.**
- Stable level:
- The version name is `vX` where `X` is an integer.
- Stable versions of features will appear in released software for many subsequent versions.
## API groups
To make it easier to extend the Kubernetes API, we implemented [*API groups*](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md).
The API group is specified in a REST path and in the `apiVersion` field of a serialized object.
Currently there are several API groups in use:
1. the "core" (oftentimes called "legacy", due to not having explicit group name) group, which is at
REST path `/api/v1` and is not specified as part of the `apiVersion` field, e.g. `apiVersion: v1`.
1. the named groups are at REST path `/apis/$GROUP_NAME/$VERSION`, and use `apiVersion: $GROUP_NAME/$VERSION`
(e.g. `apiVersion: batch/v1`). Full list of supported API groups can be seen in [Kubernetes API reference](/docs/reference/).
There are two supported paths to extending the API.
1. [Third Party Resources](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/extending-api.md)
are for users with very basic CRUD needs.
1. Coming soon: users needing the full set of Kubernetes API semantics can implement their own apiserver
and use the [aggregator](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/aggregated-api-servers.md)
to make it seamless for clients.
## Enabling API groups
Certain resources and API groups are enabled by default. They can be enabled or disabled by setting `--runtime-config`
on apiserver. `--runtime-config` accepts comma separated values. For ex: to disable batch/v1, set
`--runtime-config=batch/v1=false`, to enable batch/v2alpha1, set `--runtime-config=batch/v2alpha1`.
The flag accepts comma separated set of key=value pairs describing runtime configuration of the apiserver.
IMPORTANT: Enabling or disabling groups or resources requires restarting apiserver and controller-manager
to pick up the `--runtime-config` changes.
## Enabling resources in the groups
DaemonSets, Deployments, HorizontalPodAutoscalers, Ingress, Jobs and ReplicaSets are enabled by default.
Other extensions resources can be enabled by setting `--runtime-config` on
apiserver. `--runtime-config` accepts comma separated values. For ex: to disable deployments and jobs, set
`--runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/jobs=false`

View File

@ -0,0 +1,117 @@
---
assignees:
- bgrant0607
- mikedanese
title: What is Kubernetes?
---
Kubernetes is an [open-source platform for automating deployment, scaling, and operations of application containers](http://www.slideshare.net/BrianGrant11/wso2con-us-2015-kubernetes-a-platform-for-automating-deployment-scaling-and-operations) across clusters of hosts, providing container-centric infrastructure.
With Kubernetes, you are able to quickly and efficiently respond to customer demand:
- Deploy your applications quickly and predictably.
- Scale your applications on the fly.
- Seamlessly roll out new features.
- Optimize use of your hardware by using only the resources you need.
Our goal is to foster an ecosystem of components and tools that relieve the burden of running applications in public and private clouds.
#### Kubernetes is:
* **portable**: public, private, hybrid, multi-cloud
* **extensible**: modular, pluggable, hookable, composable
* **self-healing**: auto-placement, auto-restart, auto-replication, auto-scaling
The Kubernetes project was started by Google in 2014. Kubernetes builds upon a [decade and a half of experience that Google has with running production workloads at scale](https://research.google.com/pubs/pub43438.html), combined with best-of-breed ideas and practices from the community.
##### Ready to [Get Started](/docs/getting-started-guides/)?
## Why containers?
Looking for reasons why you should be using [containers](http://aucouranton.com/2014/06/13/linux-containers-parallels-lxc-openvz-docker-and-more/)?
![Why Containers?](/images/docs/why_containers.svg)
The *Old Way* to deploy applications was to install the applications on a host using the operating system package manager. This had the disadvantage of entangling the applications' executables, configuration, libraries, and lifecycles with each other and with the host OS. One could build immutable virtual-machine images in order to achieve predictable rollouts and rollbacks, but VMs are heavyweight and non-portable.
The *New Way* is to deploy containers based on operating-system-level virtualization rather than hardware virtualization. These containers are isolated from each other and from the host: they have their own filesystems, they can't see each others' processes, and their computational resource usage can be bounded. They are easier to build than VMs, and because they are decoupled from the underlying infrastructure and from the host filesystem, they are portable across clouds and OS distributions.
Because containers are small and fast, one application can be packed in each container image. This one-to-one application-to-image relationship unlocks the full benefits of containers. With containers, immutable container images can be created at build/release time rather than deployment time, since each application doesn't need to be composed with the rest of the application stack, nor married to the production infrastructure environment. Generating container images at build/release time enables a consistent environment to be carried from development into production.
Similarly, containers are vastly more transparent than VMs, which facilitates monitoring and management. This is especially true when the containers' process lifecycles are managed by the infrastructure rather than hidden by a process supervisor inside the container. Finally, with a single application per container, managing the containers becomes tantamount to managing deployment of the application.
Summary of container benefits:
* **Agile application creation and deployment**:
Increased ease and efficiency of container image creation compared to VM image use.
* **Continuous development, integration, and deployment**:
Provides for reliable and frequent container image build and deployment with quick and easy rollbacks (due to image immutability).
* **Dev and Ops separation of concerns**:
Create application container images at build/release time rather than deployment time, thereby decoupling applications from infrastructure.
* **Environmental consistency across development, testing, and production**:
Runs the same on a laptop as it does in the cloud.
* **Cloud and OS distribution portability**:
Runs on Ubuntu, RHEL, CoreOS, on-prem, Google Container Engine, and anywhere else.
* **Application-centric management**:
Raises the level of abstraction from running an OS on virtual hardware to run an application on an OS using logical resources.
* **Loosely coupled, distributed, elastic, liberated [micro-services](http://martinfowler.com/articles/microservices.html)**:
Applications are broken into smaller, independent pieces and can be deployed and managed dynamically -- not a fat monolithic stack running on one big single-purpose machine.
* **Resource isolation**:
Predictable application performance.
* **Resource utilization**:
High efficiency and density.
#### Why do I need Kubernetes and what can it do?
At a minimum, Kubernetes can schedule and run application containers on clusters of physical or virtual machines. However, Kubernetes also allows developers to 'cut the cord' to physical and virtual machines, moving from a **host-centric** infrastructure to a **container-centric** infrastructure, which provides the full advantages and benefits inherent to containers. Kubernetes provides the infrastructure to build a truly **container-centric** development environment.
Kubernetes satisfies a number of common needs of applications running in production, such as:
* [co-locating helper processes](/docs/user-guide/pods/), facilitating composite applications and preserving the one-application-per-container model,
* [mounting storage systems](/docs/user-guide/volumes/),
* [distributing secrets](/docs/user-guide/secrets/),
* [application health checking](/docs/user-guide/production-pods/#liveness-and-readiness-probes-aka-health-checks),
* [replicating application instances](/docs/user-guide/replication-controller/),
* [horizontal auto-scaling](/docs/user-guide/horizontal-pod-autoscaling/),
* [naming and discovery](/docs/user-guide/connecting-applications/),
* [load balancing](/docs/user-guide/services/),
* [rolling updates](/docs/tasks/run-application/rolling-update-replication-controller/),
* [resource monitoring](/docs/user-guide/monitoring/),
* [log access and ingestion](/docs/user-guide/logging/overview/),
* [support for introspection and debugging](/docs/user-guide/introspection-and-debugging/), and
* [identity and authorization](/docs/admin/authorization/).
This provides the simplicity of Platform as a Service (PaaS) with the flexibility of Infrastructure as a Service (IaaS), and facilitates portability across infrastructure providers.
For more details, see the [user guide](/docs/user-guide/).
#### Why and how is Kubernetes a platform?
Even though Kubernetes provides a lot of functionality, there are always new scenarios that would benefit from new features. Application-specific workflows can be streamlined to accelerate developer velocity. Ad hoc orchestration that is acceptable initially often requires robust automation at scale. This is why Kubernetes was also designed to serve as a platform for building an ecosystem of components and tools to make it easier to deploy, scale, and manage applications.
[Labels](/docs/user-guide/labels/) empower users to organize their resources however they please. [Annotations](/docs/user-guide/annotations/) enable users to decorate resources with custom information to facilitate their workflows and provide an easy way for management tools to checkpoint state.
Additionally, the [Kubernetes control plane](/docs/admin/cluster-components) is built upon the same [APIs](/docs/api/) that are available to developers and users. Users can write their own controllers, [schedulers](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/scheduler.md), etc., if they choose, with [their own APIs](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/extending-api.md) that can be targeted by a general-purpose [command-line tool](/docs/user-guide/kubectl-overview/).
This [design](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/principles.md) has enabled a number of other systems to build atop Kubernetes.
#### Kubernetes is not:
Kubernetes is not a traditional, all-inclusive PaaS (Platform as a Service) system. We preserve user choice where it is important.
* Kubernetes does not limit the types of applications supported. It does not dictate application frameworks (e.g., [Wildfly](http://wildfly.org/)), restrict the set of supported language runtimes (e.g., Java, Python, Ruby), cater to only [12-factor applications](http://12factor.net/), nor distinguish "apps" from "services". Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. If an application can run in a container, it should run great on Kubernetes.
* Kubernetes does not provide middleware (e.g., message buses), data-processing frameworks (e.g., Spark), databases (e.g., mysql), nor cluster storage systems (e.g., Ceph) as built-in services. Such applications run on Kubernetes.
* Kubernetes does not have a click-to-deploy service marketplace.
* Kubernetes is unopinionated in the source-to-image space. It does not deploy source code and does not build your application. Continuous Integration (CI) workflow is an area where different users and projects have their own requirements and preferences, so we support layering CI workflows on Kubernetes but don't dictate how it should work.
* Kubernetes allows users to choose the logging, monitoring, and alerting systems of their choice. (Though we do provide some integrations as proof of concept.)
* Kubernetes does not provide nor mandate a comprehensive application configuration language/system (e.g., [jsonnet](https://github.com/google/jsonnet)).
* Kubernetes does not provide nor adopt any comprehensive machine configuration, maintenance, management, or self-healing systems.
On the other hand, a number of PaaS systems run *on* Kubernetes, such as [Openshift](https://github.com/openshift/origin), [Deis](http://deis.io/), and [Eldarion](http://eldarion.cloud/). You could also roll your own custom PaaS, integrate with a CI system of your choice, or get along just fine with just Kubernetes: bring your container images and deploy them on Kubernetes.
Since Kubernetes operates at the application level rather than at just the hardware level, it provides some generally applicable features common to PaaS offerings, such as deployment, scaling, load balancing, logging, monitoring, etc. However, Kubernetes is not monolithic, and these default solutions are optional and pluggable.
Additionally, Kubernetes is not a mere "orchestration system"; it eliminates the need for orchestration. The technical definition of "orchestration" is execution of a defined workflow: do A, then B, then C. In contrast, Kubernetes is comprised of a set of independent, composable control processes that continuously drive current state towards the provided desired state. It shouldn't matter how you get from A to C: make it so. Centralized control is also not required; the approach is more akin to "choreography". This results in a system that is easier to use and more powerful, robust, resilient, and extensible.
#### What does *Kubernetes* mean? K8s?
The name **Kubernetes** originates from Greek, meaning "helmsman" or "pilot", and is the root of "governor" and ["cybernetic"](http://www.etymonline.com/index.php?term=cybernetics). **K8s** is an abbreviation derived by replacing the 8 letters "ubernete" with 8.

View File

@ -1,5 +1,9 @@
---
title: Kubernetes Objects
title: Understanding Kubernetes Objects
redirect_from:
- "/docs/concepts/abstractions/overview/"
- "/docs/concepts/abstractions/overview.html"
---
{% capture overview %}
@ -23,6 +27,7 @@ To work with Kubernetes objects--whether to create, modify, or delete them--you'
Every Kubernetes object includes two nested object fields that govern the object's configuration: the object *spec* and the object *status*. The *spec*, which you must provide, describes your *desired state* for the object--the characteristics that you want the object to have. The *status* describes the *actual state* for the object, and is supplied and updated by the Kubernetes system. At any given time, the Kubernetes Control Plane actively manages an object's actual state to match the desired state you supplied.
For example, a Kubernetes Deployment is an object that can represent an application running on your cluster. When you create the Deployment, you might set the Deployment spec to specify that you want three replicas of the application to be running. The Kubernetes system reads the Deployment spec and starts three instances of your desired application--updating the status to match your spec. If any of those instances should fail (a status change), the Kubernetes system responds to the difference between spec and status by making a correction--in this case, starting a replacement instance.
For more information on the object spec, status, and metadata, see the [Kubernetes API Conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md).
@ -33,7 +38,7 @@ When you create an object in Kubernetes, you must provide the object spec that d
Here's an example `.yaml` file that shows the required fields and object spec for a Kubernetes Deployment:
{% include code.html language="yaml" file="nginx-deployment.yaml" ghlink="/docs/concepts/abstractions/nginx-deployment.yaml" %}
{% include code.html language="yaml" file="nginx-deployment.yaml" ghlink="/docs/concepts/overview/working-with-objects/nginx-deployment.yaml" %}
One way to create a Deployment using a `.yaml` file like the one above is to use the []`kubectl create`]() command in the `kubectl` command-line interface, passing the `.yaml` file as an argument. Here's an example:

View File

@ -2,6 +2,9 @@
assignees:
- mikedanese
title: Labels and Selectors
redirect_from:
- "/docs/user-guide/labels/"
- "/docs/user-guide/labels.html"
---
_Labels_ are key/value pairs that are attached to objects, such as pods.
@ -154,7 +157,7 @@ this selector (respectively in `json` or `yaml` format) is equivalent to `compon
#### Resources that support set-based requirements
Newer resources, such as [`Job`](/docs/user-guide/jobs), [`Deployment`](/docs/user-guide/deployments/), [`Replica Set`](/docs/user-guide/replicasets/), and [`Daemon Set`](/docs/admin/daemons/), support _set-based_ requirements as well.
Newer resources, such as [`Job`](/docs/concepts/jobs/run-to-completion-finite-workloads/), [`Deployment`](/docs/user-guide/deployments/), [`Replica Set`](/docs/user-guide/replicasets/), and [`Daemon Set`](/docs/admin/daemons/), support _set-based_ requirements as well.
```yaml
selector:

View File

@ -0,0 +1,16 @@
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80

View File

@ -0,0 +1,240 @@
---
assignees:
- derekwaynecarr
title: Resource Quotas
---
When several users or teams share a cluster with a fixed number of nodes,
there is a concern that one team could use more than its fair share of resources.
Resource quotas are a tool for administrators to address this concern.
A resource quota, defined by a `ResourceQuota` object, provides constraints that limit
aggregate resource consumption per namespace. It can limit the quantity of objects that can
be created in a namespace by type, as well as the total amount of compute resources that may
be consumed by resources in that project.
Resource quotas work like this:
- Different teams work in different namespaces. Currently this is voluntary, but
support for making this mandatory via ACLs is planned.
- The administrator creates one or more Resource Quota objects for each namespace.
- Users create resources (pods, services, etc.) in the namespace, and the quota system
tracks usage to ensure it does not exceed hard resource limits defined in a Resource Quota.
- If creating or updating a resource violates a quota constraint, the request will fail with HTTP
status code `403 FORBIDDEN` with a message explaining the constraint that would have been violated.
- If quota is enabled in a namespace for compute resources like `cpu` and `memory`, users must specify
requests or limits for those values; otherwise, the quota system may reject pod creation. Hint: Use
the LimitRange admission controller to force defaults for pods that make no compute resource requirements.
See the [walkthrough](/docs/admin/resourcequota/walkthrough/) for an example to avoid this problem.
Examples of policies that could be created using namespaces and quotas are:
- In a cluster with a capacity of 32 GiB RAM, and 16 cores, let team A use 20 Gib and 10 cores,
let B use 10GiB and 4 cores, and hold 2GiB and 2 cores in reserve for future allocation.
- Limit the "testing" namespace to using 1 core and 1GiB RAM. Let the "production" namespace
use any amount.
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces,
there may be contention for resources. This is handled on a first-come-first-served basis.
Neither contention nor changes to quota will affect already created resources.
## Enabling Resource Quota
Resource Quota support is enabled by default for many Kubernetes distributions. It is
enabled when the apiserver `--admission-control=` flag has `ResourceQuota` as
one of its arguments.
Resource Quota is enforced in a particular namespace when there is a
`ResourceQuota` object in that namespace. There should be at most one
`ResourceQuota` object in a namespace.
## Compute Resource Quota
You can limit the total sum of [compute resources](/docs/user-guide/compute-resources) that can be requested in a given namespace.
The following resource types are supported:
| Resource Name | Description |
| --------------------- | ----------------------------------------------------------- |
| `cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
| `limits.cpu` | Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value. |
| `limits.memory` | Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value. |
| `memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
| `requests.cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
| `requests.memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
## Storage Resource Quota
You can limit the total sum of [storage resources](/docs/user-guide/persistent-volumes) that can be requested in a given namespace.
In addition, you can limit consumption of storage resources based on associated storage-class.
| Resource Name | Description |
| --------------------- | ----------------------------------------------------------- |
| `requests.storage` | Across all persistent volume claims, the sum of storage requests cannot exceed this value. |
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
| `<storage-class-name>.storageclass.storage.k8s.io/requests.storage` | Across all persistent volume claims associated with the storage-class-name, the sum of storage requests cannot exceed this value. |
| `<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims` | Across all persistent volume claims associated with the storage-class-name, the total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
For example, if an operator wants to quota storage with `gold` storage class separate from `bronze` storage class, the operator can
define a quota as follows:
* `gold.storageclass.storage.k8s.io/requests.storage: 500Gi`
* `bronze.storageclass.storage.k8s.io/requests.storage: 100Gi`
## Object Count Quota
The number of objects of a given type can be restricted. The following types
are supported:
| Resource Name | Description |
| ------------------------------- | ------------------------------------------------- |
| `configmaps` | The total number of config maps that can exist in the namespace. |
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
| `pods` | The total number of pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if `status.phase in (Failed, Succeeded)` is true. |
| `replicationcontrollers` | The total number of replication controllers that can exist in the namespace. |
| `resourcequotas` | The total number of [resource quotas](/docs/admin/admission-controllers/#resourcequota) that can exist in the namespace. |
| `services` | The total number of services that can exist in the namespace. |
| `services.loadbalancers` | The total number of services of type load balancer that can exist in the namespace. |
| `services.nodeports` | The total number of services of type node port that can exist in the namespace. |
| `secrets` | The total number of secrets that can exist in the namespace. |
For example, `pods` quota counts and enforces a maximum on the number of `pods`
created in a single namespace.
You might want to set a pods quota on a namespace
to avoid the case where a user creates many small pods and exhausts the cluster's
supply of Pod IPs.
## Quota Scopes
Each quota can have an associated set of scopes. A quota will only measure usage for a resource if it matches
the intersection of enumerated scopes.
When a scope is added to the quota, it limits the number of resources it supports to those that pertain to the scope.
Resources specified on the quota outside of the allowed set results in a validation error.
| Scope | Description |
| ----- | ----------- |
| `Terminating` | Match pods where `spec.activeDeadlineSeconds >= 0` |
| `NotTerminating` | Match pods where `spec.activeDeadlineSeconds is nil` |
| `BestEffort` | Match pods that have best effort quality of service. |
| `NotBestEffort` | Match pods that do not have best effort quality of service. |
The `BestEffort` scope restricts a quota to tracking the following resource: `pods`
The `Terminating`, `NotTerminating`, and `NotBestEffort` scopes restrict a quota to tracking the following resources:
* `cpu`
* `limits.cpu`
* `limits.memory`
* `memory`
* `pods`
* `requests.cpu`
* `requests.memory`
## Requests vs Limits
When allocating compute resources, each container may specify a request and a limit value for either CPU or memory.
The quota can be configured to quota either value.
If the quota has a value specified for `requests.cpu` or `requests.memory`, then it requires that every incoming
container makes an explicit request for those resources. If the quota has a value specified for `limits.cpu` or `limits.memory`,
then it requires that every incoming container specifies an explicit limit for those resources.
## Viewing and Setting Quotas
Kubectl supports creating, updating, and viewing quotas:
```shell
$ kubectl create namespace myspace
$ cat <<EOF > compute-resources.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "4"
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
EOF
$ kubectl create -f ./compute-resources.yaml --namespace=myspace
$ cat <<EOF > object-counts.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-counts
spec:
hard:
configmaps: "10"
persistentvolumeclaims: "4"
replicationcontrollers: "20"
secrets: "10"
services: "10"
services.loadbalancers: "2"
EOF
$ kubectl create -f ./object-counts.yaml --namespace=myspace
$ kubectl get quota --namespace=myspace
NAME AGE
compute-resources 30s
object-counts 32s
$ kubectl describe quota compute-resources --namespace=myspace
Name: compute-resources
Namespace: myspace
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
pods 0 4
requests.cpu 0 1
requests.memory 0 1Gi
$ kubectl describe quota object-counts --namespace=myspace
Name: object-counts
Namespace: myspace
Resource Used Hard
-------- ---- ----
configmaps 0 10
persistentvolumeclaims 0 4
replicationcontrollers 0 20
secrets 1 10
services 0 10
services.loadbalancers 0 2
```
## Quota and Cluster Capacity
Resource Quota objects are independent of the Cluster Capacity. They are
expressed in absolute units. So, if you add nodes to your cluster, this does *not*
automatically give each namespace the ability to consume more resources.
Sometimes more complex policies may be desired, such as:
- proportionally divide total cluster resources among several teams.
- allow each tenant to grow resource usage as needed, but have a generous
limit to prevent accidental resource exhaustion.
- detect demand from one namespace, add nodes, and increase quota.
Such policies could be implemented using ResourceQuota as a building-block, by
writing a 'controller' which watches the quota usage and adjusts the quota
hard limits of each namespace according to other signals.
Note that resource quota divides up aggregate cluster resources, but it creates no
restrictions around nodes: pods from several namespaces may run on the same node.
## Example
See a [detailed example for how to use resource quota](/docs/admin/resourcequota/walkthrough/).
## Read More
See [ResourceQuota design doc](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/admission_control_resource_quota.md) for more information.

View File

@ -0,0 +1,389 @@
---
assignees:
- davidopp
- thockin
title: DNS Pods and Services
---
## Introduction
As of Kubernetes 1.3, DNS is a built-in service launched automatically using the addon manager [cluster add-on](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/README.md).
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures
the kubelets to tell individual containers to use the DNS Service's IP to
resolve DNS names.
## What things get DNS names?
Every Service defined in the cluster (including the DNS server itself) is
assigned a DNS name. By default, a client Pod's DNS search list will
include the Pod's own namespace and the cluster's default domain. This is best
illustrated by example:
Assume a Service named `foo` in the Kubernetes namespace `bar`. A Pod running
in namespace `bar` can look up this service by simply doing a DNS query for
`foo`. A Pod running in namespace `quux` can look up this service by doing a
DNS query for `foo.bar`.
## Supported DNS schema
The following sections detail the supported record types and layout that is
supported. Any other layout or names or queries that happen to work are
considered implementation details and are subject to change without warning.
### Services
#### A records
"Normal" (not headless) Services are assigned a DNS A record for a name of the
form `my-svc.my-namespace.svc.cluster.local`. This resolves to the cluster IP
of the Service.
"Headless" (without a cluster IP) Services are also assigned a DNS A record for
a name of the form `my-svc.my-namespace.svc.cluster.local`. Unlike normal
Services, this resolves to the set of IPs of the pods selected by the Service.
Clients are expected to consume the set or else use standard round-robin
selection from the set.
### SRV records
SRV Records are created for named ports that are part of normal or [Headless
Services](http://releases.k8s.io/docs/user-guide/services/#headless-services).
For each named port, the SRV record would have the form
`_my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster.local`.
For a regular service, this resolves to the port number and the CNAME:
`my-svc.my-namespace.svc.cluster.local`.
For a headless service, this resolves to multiple answers, one for each pod
that is backing the service, and contains the port number and a CNAME of the pod
of the form `auto-generated-name.my-svc.my-namespace.svc.cluster.local`.
### Backwards compatibility
Previous versions of kube-dns made names of the form
`my-svc.my-namespace.cluster.local` (the 'svc' level was added later). This
is no longer supported.
### Pods
#### A Records
When enabled, pods are assigned a DNS A record in the form of `pod-ip-address.my-namespace.pod.cluster.local`.
For example, a pod with IP `1.2.3.4` in the namespace `default` with a DNS name of `cluster.local` would have an entry: `1-2-3-4.default.pod.cluster.local`.
#### A Records and hostname based on Pod's hostname and subdomain fields
Currently when a pod is created, its hostname is the Pod's `metadata.name` value.
With v1.2, users can specify a Pod annotation, `pod.beta.kubernetes.io/hostname`, to specify what the Pod's hostname should be.
The Pod annotation, if specified, takes precedence over the Pod's name, to be the hostname of the pod.
For example, given a Pod with annotation `pod.beta.kubernetes.io/hostname: my-pod-name`, the Pod will have its hostname set to "my-pod-name".
With v1.3, the PodSpec has a `hostname` field, which can be used to specify the Pod's hostname. This field value takes precedence over the
`pod.beta.kubernetes.io/hostname` annotation value.
v1.2 introduces a beta feature where the user can specify a Pod annotation, `pod.beta.kubernetes.io/subdomain`, to specify the Pod's subdomain.
The final domain will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>".
For example, a Pod with the hostname annotation set to "foo", and the subdomain annotation set to "bar", in namespace "my-namespace", will have the FQDN "foo.bar.my-namespace.svc.cluster.local"
With v1.3, the PodSpec has a `subdomain` field, which can be used to specify the Pod's subdomain. This field value takes precedence over the
`pod.beta.kubernetes.io/subdomain` annotation value.
Example:
```yaml
apiVersion: v1
kind: Service
metadata:
name: default-subdomain
spec:
selector:
name: busybox
clusterIP: None
ports:
- name: foo # Actually, no port is needed.
port: 1234
targetPort: 1234
---
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
name: busybox
spec:
hostname: busybox-1
subdomain: default-subdomain
containers:
- image: busybox
command:
- sleep
- "3600"
name: busybox
---
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
name: busybox
spec:
hostname: busybox-2
subdomain: default-subdomain
containers:
- image: busybox
command:
- sleep
- "3600"
name: busybox
```
If there exists a headless service in the same namespace as the pod and with the same name as the subdomain, the cluster's KubeDNS Server also returns an A record for the Pod's fully qualified hostname.
Given a Pod with the hostname set to "busybox-1" and the subdomain set to "default-subdomain", and a headless Service named "default-subdomain" in the same namespace, the pod will see it's own FQDN as "busybox-1.default-subdomain.my-namespace.svc.cluster.local". DNS serves an A record at that name, pointing to the Pod's IP. Both pods "busybox1" and "busybox2" can have their distinct A records.
As of Kubernetes v1.2, the Endpoints object also has the annotation `endpoints.beta.kubernetes.io/hostnames-map`. Its value is the json representation of map[string(IP)][endpoints.HostRecord], for example: '{"10.245.1.6":{HostName: "my-webserver"}}'.
If the Endpoints are for a headless service, an A record is created with the format <hostname>.<service name>.<pod namespace>.svc.<cluster domain>
For the example json, if endpoints are for a headless service named "bar", and one of the endpoints has IP "10.245.1.6", an A record is created with the name "my-webserver.bar.my-namespace.svc.cluster.local" and the A record lookup would return "10.245.1.6".
This endpoints annotation generally does not need to be specified by end-users, but can used by the internal service controller to deliver the aforementioned feature.
With v1.3, The Endpoints object can specify the `hostname` for any endpoint, along with its IP. The hostname field takes precedence over the hostname value
that might have been specified via the `endpoints.beta.kubernetes.io/hostnames-map` annotation.
With v1.3, the following annotations are deprecated: `pod.beta.kubernetes.io/hostname`, `pod.beta.kubernetes.io/subdomain`, `endpoints.beta.kubernetes.io/hostnames-map`
## How do I test if it is working?
### Create a simple Pod to use as a test environment
Create a file named busybox.yaml with the
following contents:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always
```
Then create a pod using this file:
```
kubectl create -f busybox.yaml
```
### Wait for this pod to go into the running state
You can get its status with:
```
kubectl get pods busybox
```
You should see:
```
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 0 <some-time>
```
### Validate that DNS is working
Once that pod is running, you can exec nslookup in that environment:
```
kubectl exec -ti busybox -- nslookup kubernetes.default
```
You should see something like:
```
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: kubernetes.default
Address 1: 10.0.0.1
```
If you see that, DNS is working correctly.
### Troubleshooting Tips
If the nslookup command fails, check the following:
#### Check the local DNS configuration first
Take a look inside the resolv.conf file. (See "Inheriting DNS from the node" and "Known issues" below for more information)
```
kubectl exec busybox cat /etc/resolv.conf
```
Verify that the search path and name server are set up like the following (note that search path may vary for different cloud providers):
```
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.gce_project_id.internal
nameserver 10.0.0.10
options ndots:5
```
#### Quick diagnosis
Errors such as the following indicate a problem with the kube-dns add-on or associated Services:
```
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'kubernetes.default'
```
or
```
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
```
#### Check if the DNS pod is running
Use the kubectl get pods command to verify that the DNS pod is running.
```
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
```
You should see something like:
```
NAME READY STATUS RESTARTS AGE
...
kube-dns-v19-ezo1y 3/3 Running 0 1h
...
```
If you see that no pod is running or that the pod has failed/completed, the DNS add-on may not be deployed by default in your current environment and you will have to deploy it manually.
#### Check for Errors in the DNS pod
Use `kubectl logs` command to see logs for the DNS daemons.
```
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c kubedns
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c dnsmasq
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c healthz
```
See if there is any suspicious log. W, E, F letter at the beginning represent Warning, Error and Failure. Please search for entries that have these as the logging level and use [kubernetes issues](https://github.com/kubernetes/kubernetes/issues) to report unexpected errors.
#### Is DNS service up?
Verify that the DNS service is up by using the `kubectl get service` command.
```
kubectl get svc --namespace=kube-system
```
You should see:
```
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
kube-dns 10.0.0.10 <none> 53/UDP,53/TCP 1h
...
```
If you have created the service or in the case it should be created by default but it does not appear, see this [debugging services page](http://kubernetes.io/docs/user-guide/debugging-services/) for more information.
#### Are DNS endpoints exposed?
You can verify that DNS endpoints are exposed by using the `kubectl get endpoints` command.
```
kubectl get ep kube-dns --namespace=kube-system
```
You should see something like:
```
NAME ENDPOINTS AGE
kube-dns 10.180.3.17:53,10.180.3.17:53 1h
```
If you do not see the endpoints, see endpoints section in the [debugging services documentation](http://kubernetes.io/docs/user-guide/debugging-services/).
For additional Kubernetes DNS examples, see the [cluster-dns examples](https://github.com/kubernetes/kubernetes/tree/master/examples/cluster-dns) in the Kubernetes GitHub repository.
## Kubernetes Federation (Multiple Zone support)
Release 1.3 introduced Cluster Federation support for multi-site
Kubernetes installations. This required some minor
(backward-compatible) changes to the way
the Kubernetes cluster DNS server processes DNS queries, to facilitate
the lookup of federated services (which span multiple Kubernetes clusters).
See the [Cluster Federation Administrators' Guide](/docs/admin/federation) for more
details on Cluster Federation and multi-site support.
## How it Works
The running Kubernetes DNS pod holds 3 containers - kubedns, dnsmasq and a health check called healthz.
The kubedns process watches the Kubernetes master for changes in Services and Endpoints, and maintains
in-memory lookup structures to service DNS requests. The dnsmasq container adds DNS caching to improve
performance. The healthz container provides a single health check endpoint while performing dual healthchecks
(for dnsmasq and kubedns).
The DNS pod is exposed as a Kubernetes Service with a static IP. Once assigned the
kubelet passes DNS configured using the `--cluster-dns=10.0.0.10` flag to each
container.
DNS names also need domains. The local domain is configurable, in the kubelet using
the flag `--cluster-domain=<default local domain>`
The Kubernetes cluster DNS server (based off the [SkyDNS](https://github.com/skynetservices/skydns) library)
supports forward lookups (A records), service lookups (SRV records) and reverse IP address lookups (PTR records).
## Inheriting DNS from the node
When running a pod, kubelet will prepend the cluster DNS server and search
paths to the node's own DNS settings. If the node is able to resolve DNS names
specific to the larger environment, pods should be able to, also. See "Known
issues" below for a caveat.
If you don't want this, or if you want a different DNS config for pods, you can
use the kubelet's `--resolv-conf` flag. Setting it to "" means that pods will
not inherit DNS. Setting it to a valid file path means that kubelet will use
this file instead of `/etc/resolv.conf` for DNS inheritance.
## Known issues
Kubernetes installs do not configure the nodes' resolv.conf files to use the
cluster DNS by default, because that process is inherently distro-specific.
This should probably be implemented eventually.
Linux's libc is impossibly stuck ([see this bug from
2005](https://bugzilla.redhat.com/show_bug.cgi?id=168253)) with limits of just
3 DNS `nameserver` records and 6 DNS `search` records. Kubernetes needs to
consume 1 `nameserver` record and 3 `search` records. This means that if a
local installation already uses 3 `nameserver`s or uses more than 3 `search`es,
some of those settings will be lost. As a partial workaround, the node can run
`dnsmasq` which will provide more `nameserver` entries, but not more `search`
entries. You can also use kubelet's `--resolv-conf` flag.
If you are using Alpine version 3.3 or earlier as your base image, DNS may not
work properly owing to a known issue with Alpine. Check [here](https://github.com/kubernetes/kubernetes/issues/30215)
for more information.
## References
- [Docs for the DNS cluster addon](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/dns/README.md)
## What's next
- [Autoscaling the DNS Service in a Cluster](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/).

View File

@ -2,6 +2,9 @@
assignees:
- erictune
title: Init Containers
redirect_from:
- "/docs/concepts/abstractions/init-containers/"
- "/docs/concepts/abstractions/init-containers.html"
---
{% capture overview %}

View File

@ -1,5 +1,8 @@
---
title: Pods
title: Pod Overview
redirect_from:
- "/docs/concepts/abstractions/pod/"
- "/docs/concepts/abstractions/pod.html"
---
{% capture overview %}

View File

@ -0,0 +1,76 @@
---
title: Limiting Storage Consumption
---
This example demonstrates an easy way to limit the amount of storage consumed in a namespace.
The following resources are used in the demonstration:
* [Resource Quota](/docs/admin/resourcequota/)
* [Limit Range](/docs/admin/limitrange/)
* [Persistent Volume Claim](/docs/user-guide/persistent-volumes/)
This example assumes you have a functional Kubernetes setup.
## Limiting Storage Consumption
The cluster-admin is operating a cluster on behalf of a user population and the admin wants to control
how much storage a single namespace can consume in order to control cost.
The admin would like to limit:
1. The number of persistent volume claims in a namespace
2. The amount of storage each claim can request
3. The amount of cumulative storage the namespace can have
## LimitRange to limit requests for storage
Adding a `LimitRange` to a namespace enforces storage request sizes to a minimum and maximum. Storage is requested
via `PersistentVolumeClaim`. The admission controller that enforces limit ranges will reject any PVC that is above or below
the values set by the admin.
In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi max.
```
apiVersion: v1
kind: LimitRange
metadata:
name: storagelimits
spec:
limits:
- type: PersistentVolumeClaim
max:
storage: 2Gi
min:
storage: 1Gi
```
Minimum storage requests are used when the underlying storage provider requires certain minimums. For example,
AWS EBS volumes have a 1Gi minimum requirement.
## StorageQuota to limit PVC count and cumulative storage capacity
Admins can limit the number of PVCs in a namespace as well as the cumulative capacity of those PVCs. New PVCs that exceed
either maximum value will be rejected.
In this example, a 6th PVC in the namespace would be rejected because it exceeds the maximum count of 5. Alternatively,
a 5Gi maximum quota when combined with the 2Gi max limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested
for a namespace capped at 5Gi.
```
apiVersion: v1
kind: ResourceQuota
metadata:
name: storagequota
spec:
hard:
persistentvolumeclaims: "5"
requests.storage: "5Gi"
```
## Summary
A limit range can put a ceiling on how much storage is requested while a resource quota can effectively cap the storage
consumed by a namespace through claim counts and cumulative storage capacity. The allows a cluster-admin to plan their
cluster's storage budget without risk of any one project going over their allotment.

View File

@ -0,0 +1,87 @@
---
title: Federated ConfigMap
---
This guide explains how to use ConfigMaps in a Federation control plane.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you).
Other tutorials, such as Kelsey Hightower's
[Federated Kubernetes Tutorial](https://github.com/kelseyhightower/kubernetes-cluster-federation),
might also help you create a Federated Kubernetes cluster.
You should also have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general and [ConfigMaps](/docs/user-guide/configmap/) in particular.
## Overview
Federated ConfigMaps are very similar to the traditional [Kubernetes
ConfigMaps](/docs/user-guide/configmap/) and provide the same functionality.
Creating them in the federation control plane ensures that they are synchronized
across all the clusters in federation.
## Creating a Federated ConfigMap
The API for Federated ConfigMap is 100% compatible with the
API for traditional Kubernetes ConfigMap. You can create a ConfigMap by sending
a request to the federation apiserver.
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
``` shell
kubectl --context=federation-cluster create -f myconfigmap.yaml
```
The `--context=federation-cluster` flag tells kubectl to submit the
request to the Federation apiserver instead of sending it to a Kubernetes
cluster.
Once a Federated ConfigMap is created, the federation control plane will create
a matching ConfigMap in all underlying Kubernetes clusters.
You can verify this by checking each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get configmap myconfigmap
```
The above assumes that you have a context named 'gce-asia-east1a'
configured in your client for your cluster in that zone.
These ConfigMaps in underlying clusters will match the Federated ConfigMap.
## Updating a Federated ConfigMap
You can update a Federated ConfigMap as you would update a Kubernetes
ConfigMap; however, for a Federated ConfigMap, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
The federation control plane ensures that whenever the Federated ConfigMap is
updated, it updates the corresponding ConfigMaps in all underlying clusters to
match it.
## Deleting a Federated ConfigMap
You can delete a Federated ConfigMap as you would delete a Kubernetes
ConfigMap; however, for a Federated ConfigMap, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
For example, you can do that using kubectl by running:
```shell
kubectl --context=federation-cluster delete configmap
```
Note that at this point, deleting a Federated ConfigMap will not delete the
corresponding ConfigMaps from underlying clusters.
You must delete the underlying ConfigMaps manually.
We intend to fix this in the future.

View File

@ -0,0 +1,83 @@
---
title: Federated DaemonSet
---
This guide explains how to use DaemonSets in a federation control plane.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you).
Other tutorials, such as Kelsey Hightower's
[Federated Kubernetes Tutorial](https://github.com/kelseyhightower/kubernetes-cluster-federation),
might also help you create a Federated Kubernetes cluster.
You should also have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general and DaemonSets in particular.
## Overview
DaemonSets in federation control plane ("Federated Daemonsets" in
this guide) are very similar to the traditional Kubernetes
DaemonSets and provide the same functionality.
Creating them in the federation control plane ensures that they are synchronized
across all the clusters in federation.
## Creating a Federated Daemonset
The API for Federated Daemonset is 100% compatible with the
API for traditional Kubernetes DaemonSet. You can create a DaemonSet by sending
a request to the federation apiserver.
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
``` shell
kubectl --context=federation-cluster create -f mydaemonset.yaml
```
The `--context=federation-cluster` flag tells kubectl to submit the
request to the Federation apiserver instead of sending it to a Kubernetes
cluster.
Once a Federated Daemonset is created, the federation control plane will create
a matching DaemonSet in all underlying Kubernetes clusters.
You can verify this by checking each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get daemonset mydaemonset
```
The above assumes that you have a context named 'gce-asia-east1a'
configured in your client for your cluster in that zone.
These DaemonSets in underlying clusters will match the Federated Daemonset.
## Updating a Federated Daemonset
You can update a Federated Daemonset as you would update a Kubernetes
DaemonSet; however, for a Federated Daemonset, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
The federation control plane ensures that whenever the Federated Daemonset is
updated, it updates the corresponding DaemonSets in all underlying clusters to
match it.
## Deleting a Federated Daemonset
You can delete a Federated Daemonset as you would delete a Kubernetes
DaemonSet; however, for a Federated Daemonset, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
For example, you can do that using kubectl by running:
```shell
kubectl --context=federation-cluster delete daemonset mydaemonset
```

View File

@ -0,0 +1,108 @@
---
title: Federated Deployment
---
This guide explains how to use Deployments in the Federation control plane.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you).
Other tutorials, such as Kelsey Hightower's
[Federated Kubernetes Tutorial](https://github.com/kelseyhightower/kubernetes-cluster-federation),
might also help you create a Federated Kubernetes cluster.
You should also have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general and [Deployment](/docs/user-guide/deployments) in particular.
## Overview
Deployments in federation control plane (referred to as "Federated Deployments" in
this guide) are very similar to the traditional [Kubernetes
Deployment](/docs/user-guide/deployments/), and provide the same functionality.
Creating them in the federation control plane ensures that the desired number of
replicas exist across the registered clusters.
**As of Kubernetes version 1.5, Federated Deployment is an Alpha feature. The core
functionality of Deployment is present, but some features
(such as full rollout compatibility) are still in development.**
## Creating a Federated Deployment
The API for Federated Deployment is compatible with the
API for traditional Kubernetes Deployment. You can create a Deployment by sending
a request to the federation apiserver.
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
``` shell
kubectl --context=federation-cluster create -f mydeployment.yaml
```
The '--context=federation-cluster' flag tells kubectl to submit the
request to the Federation apiserver instead of sending it to a Kubernetes
cluster.
Once a Federated Deployment is created, the federation control plane will create
a Deployment in all underlying Kubernetes clusters.
You can verify this by checking each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get deployment mydep
```
The above assumes that you have a context named 'gce-asia-east1a'
configured in your client for your cluster in that zone.
These Deployments in underlying clusters will match the federation Deployment
_except_ in the number of replicas and revision-related annotations.
Federation control plane ensures that the
sum of replicas in each cluster combined matches the desired number of replicas in the
Federated Deployment.
### Spreading Replicas in Underlying Clusters
By default, replicas are spread equally in all the underlying clusters. For ex:
if you have 3 registered clusters and you create a Federated Deployment with
`spec.replicas = 9`, then each Deployment in the 3 clusters will have
`spec.replicas=3`.
To modify the number of replicas in each cluster, you can specify
[FederatedReplicaSetPreference](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/federation/apis/federation/types.go)
as an annotation with key `federation.kubernetes.io/deployment-preferences`
on Federated Deployment.
## Updating a Federated Deployment
You can update a Federated Deployment as you would update a Kubernetes
Deployment; however, for a Federated Deployment, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
The federation control plane ensures that whenever the Federated Deployment is
updated, it updates the corresponding Deployments in all underlying clusters to
match it. So if the rolling update strategy was chosen then the underlying
cluster will do the rolling update independently and `maxSurge` and `maxUnavailable`
will apply only to individual clusters. This behavior may change in the future.
If your update includes a change in number of replicas, the federation
control plane will change the number of replicas in underlying clusters to
ensure that their sum remains equal to the number of desired replicas in
Federated Deployment.
## Deleting a Federated Deployment
You can delete a Federated Deployment as you would delete a Kubernetes
Deployment; however, for a Federated Deployment, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
For example, you can do that using kubectl by running:
```shell
kubectl --context=federation-cluster delete deployment mydep
```

View File

@ -0,0 +1,40 @@
---
title: Federated Events
---
This guide explains how to use events in federation control plane to help in debugging.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you). Other tutorials, for example
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
by Kelsey Hightower, are also available to help you.
You are also expected to have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general.
## Overview
Events in federation control plane (referred to as "federation events" in
this guide) are very similar to the traditional Kubernetes
Events providing the same functionality.
Federation Events are stored only in federation control plane and are not passed on to the underlying Kubernetes clusters.
Federation controllers create events as they process API resources to surface to the
user, the state that they are in.
You can get all events from federation apiserver by running:
```shell
kubectl --context=federation-cluster get events
```
The standard kubectl get, update, delete commands will all work.

View File

@ -0,0 +1,355 @@
---
title: Federated Ingress
---
This guide explains how to use Kubernetes Federated Ingress to deploy
a common HTTP(S) virtual IP load balancer across a federated service running in
multiple Kubernetes clusters. As of v1.4, clusters hosted in Google
Cloud (both GKE and GCE, or both) are supported. This makes it
easy to deploy a service that reliably serves HTTP(S) traffic
originating from web clients around the globe on a single, static IP
address. Low
network latency, high fault tolerance and easy administration are
ensured through intelligent request routing and automatic replica
relocation (using [Federated ReplicaSets](/docs/tasks/administer-federation/replicaset/).
Clients are automatically routed, via the shortest network path, to
the cluster closest to them with available capacity (despite the fact
that all clients use exactly the same static IP address). The load balancer
automatically checks the health of the pods comprising the service,
and avoids sending requests to unresponsive or slow pods (or entire
unresponsive clusters).
Federated Ingress is released as an alpha feature, and supports Google Cloud Platform (GKE,
GCE and hybrid scenarios involving both) in Kubernetes v1.4. Work is under way to support other cloud
providers such as AWS, and other hybrid cloud scenarios (e.g. services
spanning private on-premise as well as public cloud Kubernetes
clusters). We welcome your feedback.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you). Other tutorials, for example
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
by Kelsey Hightower, are also available to help you.
You are also expected to have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general, and [Ingress](/docs/user-guide/ingress/) in particular.
## Overview
Federated Ingresses are created in much that same way as traditional
[Kubernetes Ingresses](/docs/user-guide/ingress/): by making an API
call which specifies the desired properties of your logical ingress point. In the
case of Federated Ingress, this API call is directed to the
Federation API endpoint, rather than a Kubernetes cluster API
endpoint. The API for Federated Ingress is 100% compatible with the
API for traditional Kubernetes Services.
Once created, the Federated Ingress automatically:
1. creates matching Kubernetes Ingress objects in every cluster
underlying your Cluster Federation,
2. ensures that all of these in-cluster ingress objects share the same
logical global L7 (i.e. HTTP(S)) load balancer and IP address.
3. monitors the health and capacity of the service "shards" (i.e. your
pods) behind this ingress in each cluster
4. ensures that all client connections are routed to an appropriate
healthy backend service endpoint at all times, even in the event of
pod, cluster,
availability zone or regional outages.
Note that in the case of Google Cloud, the logical L7 load balancer is
not a single physical device (which would present both a single point
of failure, and a single global network routing choke point), but
rather a
[truly global, highly available load balancing managed service](https://cloud.google.com/load-balancing/),
globally reachable via a single, static IP address.
Clients inside your federated Kubernetes clusters (i.e. Pods) will be
automatically routed to the cluster-local shard of the Federated Service
backing the Ingress in their
cluster if it exists and is healthy, or the closest healthy shard in a
different cluster if it does not. Note that this involves a network
trip to the HTTP(s) load balancer, which resides outside your local
Kubernetes cluster but inside the same GCP region.
## Creating a federated ingress
You can create a federated ingress in any of the usual ways, for example using kubectl:
``` shell
kubectl --context=federation-cluster create -f myingress.yaml
```
For example ingress YAML configurations, see the [Ingress User Guide](/docs/user-guide/ingress/)
The '--context=federation-cluster' flag tells kubectl to submit the
request to the Federation API endpoint, with the appropriate
credentials. If you have not yet configured such a context, visit the
[federation admin guide](/docs/admin/federation/) or one of the
[administration tutorials](https://github.com/kelseyhightower/kubernetes-cluster-federation)
to find out how to do so.
As described above, the Federated Ingress will automatically create
and maintain matching Kubernetes ingresses in all of the clusters
underlying your federation. These cluster-specific ingresses (and
their associated ingress controllers) configure and manage the load
balancing and health checking infrastructure that ensures that traffic
is load balanced to each cluster appropriately.
You can verify this by checking in each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get ingress myingress
NAME HOSTS ADDRESS PORTS AGE
myingress * 130.211.5.194 80, 443 1m
```
The above assumes that you have a context named 'gce-asia-east1a'
configured in your client for your cluster in that zone. The name and
namespace of the underlying ingress will automatically match those of
the Federated Ingress that you created above (and if you happen to
have had ingresses of the same name and namespace already existing in
any of those clusters, they will be automatically adopted by the
Federation and updated to conform with the specification of your
Federated Ingress - either way, the end result will be the same).
The status of your Federated Ingress will automatically reflect the
real-time status of the underlying Kubernetes ingresses, for example:
``` shell
$kubectl --context=federation-cluster describe ingress myingress
Name: myingress
Namespace: default
Address: 130.211.5.194
TLS:
tls-secret terminates
Rules:
Host Path Backends
---- ---- --------
* * echoheaders-https:80 (10.152.1.3:8080,10.152.2.4:8080)
Annotations:
https-target-proxy: k8s-tps-default-myingress--ff1107f83ed600c0
target-proxy: k8s-tp-default-myingress--ff1107f83ed600c0
url-map: k8s-um-default-myingress--ff1107f83ed600c0
backends: {"k8s-be-30301--ff1107f83ed600c0":"Unknown"}
forwarding-rule: k8s-fw-default-myingress--ff1107f83ed600c0
https-forwarding-rule: k8s-fws-default-myingress--ff1107f83ed600c0
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
3m 3m 1 {loadbalancer-controller } Normal ADD default/myingress
2m 2m 1 {loadbalancer-controller } Normal CREATE ip: 130.211.5.194
```
Note that:
1. the address of your Federated Ingress
corresponds with the address of all of the
underlying Kubernetes ingresses (once these have been allocated - this
may take up to a few minutes).
2. we have not yet provisioned any backend Pods to receive
the network traffic directed to this ingress (i.e. 'Service
Endpoints' behind the service backing the Ingress), so the Federated Ingress does not yet consider these to
be healthy shards and will not direct traffic to any of these clusters.
3. the federation control system will
automatically reconfigure the load balancer controllers in all of the
clusters in your federation to make them consistent, and allow
them to share global load balancers. But this reconfiguration can
only complete successfully if there are no pre-existing Ingresses in
those clusters (this is a safety feature to prevent accidental
breakage of existing ingresses). So to ensure that your federated
ingresses function correctly, either start with new, empty clusters, or make
sure that you delete (and recreate if necessary) all pre-existing
Ingresses in the clusters comprising your federation.
#Adding backend services and pods
To render the underlying ingress shards healthy, we need to add
backend Pods behind the service upon which the Ingress is based. There are several ways to achieve this, but
the easiest is to create a Federated Service and
Federated Replicaset. Details of how those
work are covered in the aforementioned user guides - here we'll simply use them, to
create appropriately labelled pods and services in the 13 underlying clusters of
our federation:
``` shell
kubectl --context=federation-cluster create -f services/nginx.yaml
```
``` shell
kubectl --context=federation-cluster create -f myreplicaset.yaml
```
Note that in order for your federated ingress to work correctly on
Google Cloud, the node ports of all of the underlying cluster-local
services need to be identical. If you're using a federated service
this is easy to do. Simply pick a node port that is not already
being used in any of your clusters, and add that to the spec of your
federated service. If you do not specify a node port for your
federated service, each cluster will choose it's own node port for
its cluster-local shard of the service, and these will probably end
up being different, which is not what you want.
You can verify this by checking in each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get services nginx
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx 10.63.250.98 104.199.136.89 80/TCP 9m
```
## Hybrid cloud capabilities
Federations of Kubernetes Clusters can include clusters running in
different cloud providers (e.g. Google Cloud, AWS), and on-premises
(e.g. on OpenStack). However, in Kubernetes v1.4, Federated Ingress is only
supported across Google Cloud clusters. In future versions we intend
to support hybrid cloud Ingress-based deployments.
## Discovering a federated ingress
Ingress objects (in both plain Kubernets clusters, and in federations
of clusters) expose one or more IP addresses (via
the Status.Loadbalancer.Ingress field) that remains static for the lifetime
of the Ingress object (in future, automatically managed DNS names
might also be added). All clients (whether internal to your cluster,
or on the external network or internet) should connect to one of these IP
or DNS addresses. As mentioned above, all client requests are automatically
routed, via the shortest network path, to a healthy pod in the
closest cluster to the origin of the request. So for example, HTTP(S)
requests from internet
users in Europe will be routed directly to the closest cluster in
Europe that has available capacity. If there are no such clusters in
Europe, the request will be routed to the next closest cluster
(typically in the U.S.).
## Handling failures of backend pods and whole clusters
Ingresses are backed by Services, which are typically (but not always)
backed by one or more ReplicaSets. For Federated Ingresses, it is
common practise to use the federated variants of Services and
ReplicaSets for this purpose, as
described above.
In particular, Federated ReplicaSets ensure that the desired number of
pods are kept running in each cluster, even in the event of node
failures. In the event of entire cluster or availability zone
failures, Federated ReplicaSets automatically place additional
replacas in the other available clusters in the federation to accommodate the
traffic which was previously being served by the now unavailable
cluster. While the Federated ReplicaSet ensures that sufficient replicas are
kept running, the Federated Ingress ensures that user traffic is
automatically redirected away from the failed cluster to other
available clusters.
## Known issue
GCE L7 load balancer back-ends and health checks are known to "flap"; this is due
to conflicting firewall rules in the federation's underlying clusters, which might override one another. To work around this problem, you can
install the firewall rules manually to expose the targets of all the
underlying clusters in your federation for each Federated Ingress
object. This way, the health checks can consistently pass and the GCE L7 load balancer
can remain stable. You install the rules using the
[`gcloud`](https://cloud.google.com/sdk/gcloud/) command line tool,
[Google Cloud Console](https://console.cloud.google.com) or the
[Google Compute Engine APIs](https://cloud.google.com/compute/docs/reference/latest/).
You can install these rules using
[`gcloud`](https://cloud.google.com/sdk/gcloud/) as follows:
```shell
gcloud compute firewall-rules create <firewall-rule-name> \
--source-ranges 130.211.0.0/22 --allow [<service-nodeports>] \
--target-tags [<target-tags>] \
--network <network-name>
```
where:
1. `firewall-rule-name` can be any name.
2. `[<service-nodeports>]` is the comma separated list of node ports corresponding to the services that back the Federated Ingress.
3. [<target-tags>] is the comma separated list of the target tags assigned to the nodes in a Kubernetes cluster.
4. <network-name> is the name of the network where the firewall rule must be installed.
Example:
```shell
gcloud compute firewall-rules create my-federated-ingress-firewall-rule \
--source-ranges 130.211.0.0/22 --allow tcp:30301, tcp:30061, tcp:34564 \
--target-tags my-cluster-1-minion, my-cluster-2-minion \
--network default
```
## Troubleshooting
#### I cannot connect to my cluster federation API
Check that your
1. Client (typically kubectl) is correctly configured (including API endpoints and login credentials), and
2. Cluster Federation API server is running and network-reachable.
See the [federation admin guide](/docs/admin/federation/) to learn
how to bring up a cluster federation correctly (or have your cluster administrator do this for you), and how to correctly configure your client.
#### I can create a federated ingress/service/replicaset successfully against the cluster federation API, but no matching ingresses/services/replicasets are created in my underlying clusters
Check that:
1. Your clusters are correctly registered in the Cluster Federation API (`kubectl describe clusters`)
2. Your clusters are all 'Active'. This means that the cluster
Federation system was able to connect and authenticate against the
clusters' endpoints. If not, consult the event logs of the federation-controller-manager pod to ascertain what the failure might be. (`kubectl --namespace=federation logs $(kubectl get pods --namespace=federation -l module=federation-controller-manager -oname`)
3. That the login credentials provided to the Cluster Federation API
for the clusters have the correct authorization and quota to create
ingresses/services/replicasets in the relevant namespace in the
clusters. Again you should see associated error messages providing
more detail in the above event log file if this is not the case.
4. Whether any other error is preventing the service creation
operation from succeeding (look for `ingress-controller`,
`service-controller` or `replicaset-controller`,
errors in the output of `kubectl logs federation-controller-manager --namespace federation`).
#### I can create a federated ingress successfully, but request load is not correctly distributed across the underlying clusters
Check that:
1. the services underlying your federated ingress in each cluster have
identical node ports. See [above](#creating_a_federated_ingress) for further explanation.
2. the load balancer controllers in each of your clusters are of the
correct type ("GLBC") and have been correctly reconfigured by the
federation control plane to share a global GCE load balancer (this
should happen automatically). If they of the correct type, and
have been correctly reconfigured, the UID data item in the GLBC
configmap in each cluster will be identical across all clusters.
See
[the GLBC docs](https://github.com/kubernetes/ingress/blob/7dcb4ae17d5def23d3e9c878f3146ac6df61b09d/controllers/gce/README.md)
for further details.
If this is not the case, check the logs of your federation
controller manager to determine why this automated reconfiguration
might be failing.
3. no ingresses have been manually created in any of your clusters before the above
reconfiguration of the load balancer controller completed
successfully. Ingresses created before the reconfiguration of
your GLBC will interfere with the behavior of your federated
ingresses created after the reconfiguration (see
[the GLBC docs](https://github.com/kubernetes/ingress/blob/7dcb4ae17d5def23d3e9c878f3146ac6df61b09d/controllers/gce/README.md)
for further information. To remedy this,
delete any ingresses created before the cluster joined the
federation (and had it's GLBC reconfigured), and recreate them if
necessary.
#### This troubleshooting guide did not help me solve my problem
Please use one of our [support channels](http://kubernetes.io/docs/troubleshooting/) to seek assistance.
## For more information
* [Federation proposal](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/proposals/federation.md) details use cases that motivated this work.

View File

@ -0,0 +1,90 @@
---
title: Federated Namespaces
---
This guide explains how to use namespaces in Federation control plane.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you). Other tutorials, for example
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
by Kelsey Hightower, are also available to help you.
You are also expected to have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general and [Namespaces](/docs/user-guide/namespaces/) in particular.
## Overview
Namespaces in federation control plane (referred to as "federated namespaces" in
this guide) are very similar to the traditional [Kubernetes
Namespaces](/docs/user-guide/namespaces/) providing the same functionality.
Creating them in the federation control plane ensures that they are synchronized
across all the clusters in federation.
## Creating a Federated Namespace
The API for Federated Namespaces is 100% compatible with the
API for traditional Kubernetes Namespaces. You can create a namespace by sending
a request to the federation apiserver.
You can do that using kubectl by running:
``` shell
kubectl --context=federation-cluster create -f myns.yaml
```
The '--context=federation-cluster' flag tells kubectl to submit the
request to the Federation apiserver instead of sending it to a Kubernetes
cluster.
Once a federated namespace is created, the federation control plane will create
a matching namespace in all underlying Kubernetes clusters.
You can verify this by checking each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get namespaces myns
```
The above assumes that you have a context named 'gce-asia-east1a'
configured in your client for your cluster in that zone. The name and
spec of the underlying namespace will match those of
the Federated Namespace that you created above.
## Updating a Federated Namespace
You can update a federated namespace as you would update a Kubernetes
namespace, just send the request to federation apiserver instead of sending it
to a specific Kubernetes cluster.
Federation control plan will ensure that whenever the federated namespace is
updated, it updates the corresponding namespaces in all underlying clusters to
match it.
## Deleting a Federated Namespace
You can delete a federated namespace as you would delete a Kubernetes
namespace, just send the request to federation apiserver instead of sending it
to a specific Kubernetes cluster.
For example, you can do that using kubectl by running:
```shell
kubectl --context=federation-cluster delete ns myns
```
As in Kubernetes, deleting a federated namespace will delete all resources in that
namespace from the federation control plane.
Note that at this point, deleting a federated namespace will not delete the
corresponding namespaces and resources in those namespaces from underlying clusters.
Users are expected to delete them manually.
We intend to fix this in the future.

View File

@ -0,0 +1,105 @@
---
title: Federated ReplicaSets
---
This guide explains how to use replica sets in the Federation control plane.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you). Other tutorials, for example
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
by Kelsey Hightower, are also available to help you.
You are also expected to have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general and [ReplicaSets](/docs/user-guide/replicasets/) in particular.
## Overview
Replica Sets in federation control plane (referred to as "federated replica sets" in
this guide) are very similar to the traditional [Kubernetes
ReplicaSets](/docs/user-guide/replicasets/), and provide the same functionality.
Creating them in the federation control plane ensures that the desired number of
replicas exist across the registered clusters.
## Creating a Federated Replica Set
The API for Federated Replica Set is 100% compatible with the
API for traditional Kubernetes Replica Set. You can create a replica set by sending
a request to the federation apiserver.
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
``` shell
kubectl --context=federation-cluster create -f myrs.yaml
```
The '--context=federation-cluster' flag tells kubectl to submit the
request to the Federation apiserver instead of sending it to a Kubernetes
cluster.
Once a federated replica set is created, the federation control plane will create
a replica set in all underlying Kubernetes clusters.
You can verify this by checking each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get rs myrs
```
The above assumes that you have a context named 'gce-asia-east1a'
configured in your client for your cluster in that zone.
These replica sets in underlying clusters will match the federation replica set
except in the number of replicas. Federation control plane will ensure that the
sum of replicas in each cluster match the desired number of replicas in the
federation replica set.
### Spreading Replicas in Underlying Clusters
By default, replicas are spread equally in all the underlying clusters. For ex:
if you have 3 registered clusters and you create a federated replica set with
`spec.replicas = 9`, then each replica set in the 3 clusters will have
`spec.replicas=3`.
To modify the number of replicas in each cluster, you can specify
[FederatedReplicaSetPreference](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/federation/apis/federation/types.go)
as an annotation with key `federation.kubernetes.io/replica-set-preferences`
on federated replica set.
## Updating a Federated Replica Set
You can update a federated replica set as you would update a Kubernetes
replica set; however, for a federated replica set, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
The Federation control plan ensures that whenever the federated replica set is
updated, it updates the corresponding replica sets in all underlying clusters to
match it.
If your update includes a change in number of replicas, the federation
control plane will change the number of replicas in underlying clusters to
ensure that their sum remains equal to the number of desired replicas in
federated replica set.
## Deleting a Federated Replica Set
You can delete a federated replica set as you would delete a Kubernetes
replica set; however, for a federated replica set, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
For example, you can do that using kubectl by running:
```shell
kubectl --context=federation-cluster delete rs myrs
```
Note that at this point, deleting a federated replica set will not delete the
corresponding replica sets from underlying clusters.
You must delete the underlying Replica Sets manually.
We intend to fix this in the future.

View File

@ -0,0 +1,87 @@
---
title: Federated Secrets
---
This guide explains how to use secrets in Federation control plane.
* TOC
{:toc}
## Prerequisites
This guide assumes that you have a running Kubernetes Cluster
Federation installation. If not, then head over to the
[federation admin guide](/docs/admin/federation/) to learn how to
bring up a cluster federation (or have your cluster administrator do
this for you). Other tutorials, for example
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
by Kelsey Hightower, are also available to help you.
You are also expected to have a basic
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
general and [Secrets](/docs/user-guide/secrets/) in particular.
## Overview
Secrets in federation control plane (referred to as "federated secrets" in
this guide) are very similar to the traditional [Kubernetes
Secrets](/docs/user-guide/secrets/) providing the same functionality.
Creating them in the federation control plane ensures that they are synchronized
across all the clusters in federation.
## Creating a Federated Secret
The API for Federated Secret is 100% compatible with the
API for traditional Kubernetes Secret. You can create a secret by sending
a request to the federation apiserver.
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
``` shell
kubectl --context=federation-cluster create -f mysecret.yaml
```
The '--context=federation-cluster' flag tells kubectl to submit the
request to the Federation apiserver instead of sending it to a Kubernetes
cluster.
Once a federated secret is created, the federation control plane will create
a matching secret in all underlying Kubernetes clusters.
You can verify this by checking each of the underlying clusters, for example:
``` shell
kubectl --context=gce-asia-east1a get secret mysecret
```
The above assumes that you have a context named 'gce-asia-east1a'
configured in your client for your cluster in that zone.
These secrets in underlying clusters will match the federated secret.
## Updating a Federated Secret
You can update a federated secret as you would update a Kubernetes
secret; however, for a federated secret, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
The Federation control plan ensures that whenever the federated secret is
updated, it updates the corresponding secrets in all underlying clusters to
match it.
## Deleting a Federated Secret
You can delete a federated secret as you would delete a Kubernetes
secret; however, for a federated secret, you must send the request to
the federation apiserver instead of sending it to a specific Kubernetes cluster.
For example, you can do that using kubectl by running:
```shell
kubectl --context=federation-cluster delete secret mysecret
```
Note that at this point, deleting a federated secret will not delete the
corresponding secrets from underlying clusters.
You must delete the underlying secrets manually.
We intend to fix this in the future.

View File

@ -0,0 +1,366 @@
---
assignees:
- derekwaynecarr
- janetkuo
title: Applying Resource Quotas and Limits
---
This example demonstrates a typical setup to control for resource usage in a namespace.
It demonstrates using the following resources:
* [Namespace](/docs/admin/namespaces)
* [Resource Quota](/docs/admin/resourcequota/)
* [Limit Range](/docs/admin/limitrange/)
This example assumes you have a functional Kubernetes setup.
## Scenario
The cluster-admin is operating a cluster on behalf of a user population and the cluster-admin
wants to control the amount of resources that can be consumed in a particular namespace to promote
fair sharing of the cluster and control cost.
The cluster-admin has the following goals:
* Limit the amount of compute resource for running pods
* Limit the number of persistent volume claims to control access to storage
* Limit the number of load balancers to control cost
* Prevent the use of node ports to preserve scarce resources
* Provide default compute resource requests to enable better scheduling decisions
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called quota-example:
```shell
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
namespace "quota-example" created
$ kubectl get namespaces
NAME STATUS AGE
default Active 2m
kube-system Active 2m
quota-example Active 39s
```
## Step 2: Apply an object-count quota to the namespace
The cluster-admin wants to control the following resources:
* persistent volume claims
* load balancers
* node ports
Let's create a simple quota that controls object counts for those resource types in this namespace.
```shell
$ kubectl create -f docs/admin/resourcequota/object-counts.yaml --namespace=quota-example
resourcequota "object-counts" created
```
The quota system will observe that a quota has been created, and will calculate consumption
in the namespace in response. This should happen quickly.
Let's describe the quota to see what is currently being consumed in this namespace:
```shell
$ kubectl describe quota object-counts --namespace=quota-example
Name: object-counts
Namespace: quota-example
Resource Used Hard
-------- ---- ----
persistentvolumeclaims 0 2
services.loadbalancers 0 2
services.nodeports 0 0
```
The quota system will now prevent users from creating more than the specified amount for each resource.
## Step 3: Apply a compute-resource quota to the namespace
To limit the amount of compute resource that can be consumed in this namespace,
let's create a quota that tracks compute resources.
```shell
$ kubectl create -f docs/admin/resourcequota/compute-resources.yaml --namespace=quota-example
resourcequota "compute-resources" created
```
Let's describe the quota to see what is currently being consumed in this namespace:
```shell
$ kubectl describe quota compute-resources --namespace=quota-example
Name: compute-resources
Namespace: quota-example
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
pods 0 4
requests.cpu 0 1
requests.memory 0 1Gi
```
The quota system will now prevent the namespace from having more than 4 non-terminal pods. In
addition, it will enforce that each container in a pod makes a `request` and defines a `limit` for
`cpu` and `memory`.
## Step 4: Applying default resource requests and limits
Pod authors rarely specify resource requests and limits for their pods.
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
cpu and memory by creating an nginx container.
To demonstrate, lets create a deployment that runs nginx:
```shell
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
deployment "nginx" created
```
Now let's look at the pods that were created.
```shell
$ kubectl get pods --namespace=quota-example
```
What happened? I have no pods! Let's describe the deployment to get a view of what is happening.
```shell
$ kubectl describe deployment nginx --namespace=quota-example
Name: nginx
Namespace: quota-example
CreationTimestamp: Mon, 06 Jun 2016 16:11:37 -0400
Labels: run=nginx
Selector: run=nginx
Replicas: 0 updated | 1 total | 0 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
OldReplicaSets: <none>
NewReplicaSet: nginx-3137573019 (0/1 replicas created)
...
```
A deployment created a corresponding replica set and attempted to size it to create a single pod.
Let's look at the replica set to get more detail.
```shell
$ kubectl describe rs nginx-3137573019 --namespace=quota-example
Name: nginx-3137573019
Namespace: quota-example
Image(s): nginx
Selector: pod-template-hash=3137573019,run=nginx
Labels: pod-template-hash=3137573019
run=nginx
Replicas: 0 current / 1 desired
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4m 7s 11 {replicaset-controller } Warning FailedCreate Error creating: pods "nginx-3137573019-" is forbidden: Failed quota: compute-resources: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
```
The Kubernetes API server is rejecting the replica set requests to create a pod because our pods
do not specify `requests` or `limits` for `cpu` and `memory`.
So let's set some default values for the amount of `cpu` and `memory` a pod can consume:
```shell
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
limitrange "limits" created
$ kubectl describe limits limits --namespace=quota-example
Name: limits
Namespace: quota-example
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Container memory - - 256Mi 512Mi -
Container cpu - - 100m 200m -
```
If the Kubernetes API server observes a request to create a pod in this namespace, and the containers
in that pod do not make any compute resource requests, a default request and default limit will be applied
as part of admission control.
In this example, each pod created will have compute resources equivalent to the following:
```shell
$ kubectl run nginx \
--image=nginx \
--replicas=1 \
--requests=cpu=100m,memory=256Mi \
--limits=cpu=200m,memory=512Mi \
--namespace=quota-example
```
Now that we have applied default compute resources for our namespace, our replica set should be able to create
its pods.
```shell
$ kubectl get pods --namespace=quota-example
NAME READY STATUS RESTARTS AGE
nginx-3137573019-fvrig 1/1 Running 0 6m
```
And if we print out our quota usage in the namespace:
```shell
$ kubectl describe quota --namespace=quota-example
Name: compute-resources
Namespace: quota-example
Resource Used Hard
-------- ---- ----
limits.cpu 200m 2
limits.memory 512Mi 2Gi
pods 1 4
requests.cpu 100m 1
requests.memory 256Mi 1Gi
Name: object-counts
Namespace: quota-example
Resource Used Hard
-------- ---- ----
persistentvolumeclaims 0 2
services.loadbalancers 0 2
services.nodeports 0 0
```
As you can see, the pod that was created is consuming explicit amounts of compute resources, and the usage is being
tracked by Kubernetes properly.
## Step 5: Advanced quota scopes
Let's imagine you did not want to specify default compute resource consumption in your namespace.
Instead, you want to let users run a specific number of `BestEffort` pods in their namespace to take
advantage of slack compute resources, and then require that users make an explicit resource request for
pods that require a higher quality of service.
Let's create a new namespace with two quotas to demonstrate this behavior:
```shell
$ kubectl create namespace quota-scopes
namespace "quota-scopes" created
$ kubectl create -f docs/admin/resourcequota/best-effort.yaml --namespace=quota-scopes
resourcequota "best-effort" created
$ kubectl create -f docs/admin/resourcequota/not-best-effort.yaml --namespace=quota-scopes
resourcequota "not-best-effort" created
$ kubectl describe quota --namespace=quota-scopes
Name: best-effort
Namespace: quota-scopes
Scopes: BestEffort
* Matches all pods that have best effort quality of service.
Resource Used Hard
-------- ---- ----
pods 0 10
Name: not-best-effort
Namespace: quota-scopes
Scopes: NotBestEffort
* Matches all pods that do not have best effort quality of service.
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
pods 0 4
requests.cpu 0 1
requests.memory 0 1Gi
```
In this scenario, a pod that makes no compute resource requests will be tracked by the `best-effort` quota.
A pod that does make compute resource requests will be tracked by the `not-best-effort` quota.
Let's demonstrate this by creating two deployments:
```shell
$ kubectl run best-effort-nginx --image=nginx --replicas=8 --namespace=quota-scopes
deployment "best-effort-nginx" created
$ kubectl run not-best-effort-nginx \
--image=nginx \
--replicas=2 \
--requests=cpu=100m,memory=256Mi \
--limits=cpu=200m,memory=512Mi \
--namespace=quota-scopes
deployment "not-best-effort-nginx" created
```
Even though no default limits were specified, the `best-effort-nginx` deployment will create
all 8 pods. This is because it is tracked by the `best-effort` quota, and the `not-best-effort`
quota will just ignore it. The `not-best-effort` quota will track the `not-best-effort-nginx`
deployment since it creates pods with `Burstable` quality of service.
Let's list the pods in the namespace:
```shell
$ kubectl get pods --namespace=quota-scopes
NAME READY STATUS RESTARTS AGE
best-effort-nginx-3488455095-2qb41 1/1 Running 0 51s
best-effort-nginx-3488455095-3go7n 1/1 Running 0 51s
best-effort-nginx-3488455095-9o2xg 1/1 Running 0 51s
best-effort-nginx-3488455095-eyg40 1/1 Running 0 51s
best-effort-nginx-3488455095-gcs3v 1/1 Running 0 51s
best-effort-nginx-3488455095-rq8p1 1/1 Running 0 51s
best-effort-nginx-3488455095-udhhd 1/1 Running 0 51s
best-effort-nginx-3488455095-zmk12 1/1 Running 0 51s
not-best-effort-nginx-2204666826-7sl61 1/1 Running 0 23s
not-best-effort-nginx-2204666826-ke746 1/1 Running 0 23s
```
As you can see, all 10 pods have been allowed to be created.
Let's describe current quota usage in the namespace:
```shell
$ kubectl describe quota --namespace=quota-scopes
Name: best-effort
Namespace: quota-scopes
Scopes: BestEffort
* Matches all pods that have best effort quality of service.
Resource Used Hard
-------- ---- ----
pods 8 10
Name: not-best-effort
Namespace: quota-scopes
Scopes: NotBestEffort
* Matches all pods that do not have best effort quality of service.
Resource Used Hard
-------- ---- ----
limits.cpu 400m 2
limits.memory 1Gi 2Gi
pods 2 4
requests.cpu 200m 1
requests.memory 512Mi 1Gi
```
As you can see, the `best-effort` quota has tracked the usage for the 8 pods we created in
the `best-effort-nginx` deployment, and the `not-best-effort` quota has tracked the usage for
the 2 pods we created in the `not-best-effort-nginx` quota.
Scopes provide a mechanism to subdivide the set of resources that are tracked by
any quota document to allow greater flexibility in how operators deploy and track resource
consumption.
In addition to `BestEffort` and `NotBestEffort` scopes, there are scopes to restrict
long-running versus time-bound pods. The `Terminating` scope will match any pod
where `spec.activeDeadlineSeconds is not nil`. The `NotTerminating` scope will match any pod
where `spec.activeDeadlineSeconds is nil`. These scopes allow you to quota pods based on their
anticipated permanence on a node in your cluster.
## Summary
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined by the namespace quota.
Any action that consumes those resources can be tweaked, or can pick up namespace level defaults to meet your end goal.
Quota can be apportioned based on quality of service and anticipated permanence on a node in your cluster.

View File

@ -0,0 +1,95 @@
---
assignees:
- davidopp
title: Configuring a Pod Disruption Budget
---
This guide is for anyone wishing to specify safety constraints on pods or anyone
wishing to write software (typically automation software) that respects those
constraints.
* TOC
{:toc}
## Rationale
Various cluster management operations may voluntarily evict pods. "Voluntary"
means an eviction can be safely delayed for a reasonable period of time. The
principal examples today are draining a node for maintenance or upgrade
(`kubectl drain`), and cluster autoscaling down. In the future the
[rescheduler](https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduling.md)
may also perform voluntary evictions. By contrast, something like evicting pods
because a node has become unreachable or reports `NotReady`, is not "voluntary."
For voluntary evictions, it can be useful for applications to be able to limit
the number of pods that are down simultaneously. For example, a quorum-based application would
like to ensure that the number of replicas running is never brought below the
number needed for a quorum, even temporarily. Or a web front end might want to
ensure that the number of replicas serving load never falls below a certain
percentage of the total, even briefly. `PodDisruptionBudget` is an API object
that specifies the minimum number or percentage of replicas of a collection that
must be up at a time. Components that wish to evict a pod subject to disruption
budget use the `/eviction` subresource; unlike a regular pod deletion, this
operation may be rejected by the API server if the eviction would cause a
disruption budget to be violated.
## Specifying a PodDisruptionBudget
A `PodDisruptionBudget` has two components: a label selector `selector` to specify the set of
pods to which it applies, and `minAvailable` which is a description of the number of pods from that
set that must still be available after the eviction, i.e. even in the absence
of the evicted pod. `minAvailable` can be either an absolute number or a percentage.
So for example, 100% means no voluntary evictions from the set are permitted. In
typical usage, a single budget would be used for a collection of pods managed by
a controller—for example, the pods in a single ReplicaSet.
Note that a disruption budget does not truly guarantee that the specified
number/percentage of pods will always be up. For example, a node that hosts a
pod from the collection may fail when the collection is at the minimum size
specified in the budget, thus bringing the number of available pods from the
collection below the specified size. The budget can only protect against
voluntary evictions, not all causes of unavailability.
## Requesting an eviction
If you are writing infrastructure software that wants to produce these voluntary
evictions, you will need to use the eviction API. The eviction subresource of a
pod can be thought of as a kind of policy-controlled DELETE operation on the pod
itself. To attempt an eviction (perhaps more REST-precisely, to attempt to
*create* an eviction), you POST an attempted operation. Here's an example:
```json
{
"apiVersion": "policy/v1beta1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}
```
You can attempt an eviction using `curl`:
```bash
$ curl -v -H 'Content-type: application/json' http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json
```
The API can respond in one of three ways.
1. If the eviction is granted, then the pod is deleted just as if you had sent
a `DELETE` request to the pod's URL and you get back `200 OK`.
2. If the current state of affairs wouldn't allow an eviction by the rules set
forth in the budget, you get back `429 Too Many Requests`. This is
typically used for generic rate limiting of *any* requests, but here we mean
that this request isn't allowed *right now* but it may be allowed later.
Currently, callers do not get any `Retry-After` advice, but they may in
future versions.
3. If there is some kind of misconfiguration, like multiple budgets pointing at
the same pod, you will get `500 Internal Server Error`.
For a given eviction request, there are two cases.
1. There is no budget that matches this pod. In this case, the server always
returns `200 OK`.
2. There is at least one budget. In this case, any of the three above responses may
apply.

View File

@ -0,0 +1,214 @@
---
assignees:
- derekwaynecarr
- janetkuo
title: Setting Pod CPU and Memory Limits
---
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
system will be able to consume as much CPU and memory on the node that executes the pod.
Users may want to impose restrictions on the amount of resources a single pod in the system may consume
for a variety of reasons.
For example:
1. Each node in the cluster has 2GB of memory. The cluster operator does not want to accept pods
that require more than 2GB of memory since no node in the cluster can support the requirement. To prevent a
pod from being permanently unscheduled to a node, the operator instead chooses to reject pods that exceed 2GB
of memory as part of admission control.
2. A cluster is shared by two communities in an organization that runs production and development workloads
respectively. Production workloads may consume up to 8GB of memory, but development workloads may consume up
to 512MB of memory. The cluster operator creates a separate namespace for each workload, and applies limits to
each namespace.
3. Users may create a pod which consumes resources just below the capacity of a machine. The left over space
may be too small to be useful, but big enough for the waste to be costly over the entire cluster. As a result,
the cluster operator may want to set limits that a pod must consume at least 20% of the memory and CPU of their
average node size in order to provide for more uniform scheduling and limit waste.
This example demonstrates how limits can be applied to a Kubernetes [namespace](/docs/admin/namespaces/walkthrough/) to control
min/max resource limits per pod. In addition, this example demonstrates how you can
apply default resource limits to pods in the absence of an end-user specified value.
See [LimitRange design doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_limit_range.md) for more information. For a detailed description of the Kubernetes resource model, see [Resources](/docs/user-guide/compute-resources/)
## Step 0: Prerequisites
This example requires a running Kubernetes cluster. See the [Getting Started guides](/docs/getting-started-guides/) for how to get started.
Change to the `<kubernetes>` directory if you're not already there.
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called limit-example:
```shell
$ kubectl create namespace limit-example
namespace "limit-example" created
```
Note that `kubectl` commands will print the type and name of the resource created or mutated, which can then be used in subsequent commands:
```shell
$ kubectl get namespaces
NAME STATUS AGE
default Active 51s
limit-example Active 45s
```
## Step 2: Apply a limit to the namespace
Let's create a simple limit in our namespace.
```shell
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
limitrange "mylimits" created
```
Let's describe the limits that we have imposed in our namespace.
```shell
$ kubectl describe limits mylimits --namespace=limit-example
Name: mylimits
Namespace: limit-example
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
---- -------- --- --- --------------- ------------- -----------------------
Pod cpu 200m 2 - - -
Pod memory 6Mi 1Gi - - -
Container cpu 100m 2 200m 300m -
Container memory 3Mi 1Gi 100Mi 200Mi -
```
In this scenario, we have said the following:
1. If a max constraint is specified for a resource (2 CPU and 1Gi memory in this case), then a limit
must be specified for that resource across all containers. Failure to specify a limit will result in
a validation error when attempting to create the pod. Note that a default value of limit is set by
*default* in file `limits.yaml` (300m CPU and 200Mi memory).
2. If a min constraint is specified for a resource (100m CPU and 3Mi memory in this case), then a
request must be specified for that resource across all containers. Failure to specify a request will
result in a validation error when attempting to create the pod. Note that a default value of request is
set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
3. For any pod, the sum of all containers memory requests must be >= 6Mi and the sum of all containers
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
containers CPU limits must be <= 2.
## Step 3: Enforcing limits at point of creation
The limits enumerated in a namespace are only enforced when a pod is created or updated in
the cluster. If you change the limits to a different value range, it does not affect pods that
were previously created in a namespace.
If a resource (CPU or memory) is being restricted by a limit, the user will get an error at time
of creation explaining why.
Let's first spin up a [Deployment](/docs/user-guide/deployments) that creates a single container Pod to demonstrate
how default values are applied to each pod.
```shell
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
deployment "nginx" created
```
Note that `kubectl run` creates a Deployment named "nginx" on Kubernetes cluster >= v1.2. If you are running older versions, it creates replication controllers instead.
If you want to obtain the old behavior, use `--generator=run/v1` to create replication controllers. See [`kubectl run`](/docs/user-guide/kubectl/kubectl_run/) for more details.
The Deployment manages 1 replica of single container Pod. Let's take a look at the Pod it manages. First, find the name of the Pod:
```shell
$ kubectl get pods --namespace=limit-example
NAME READY STATUS RESTARTS AGE
nginx-2040093540-s8vzu 1/1 Running 0 11s
```
Let's print this Pod with yaml output format (using `-o yaml` flag), and then `grep` the `resources` field. Note that your pod name will be different.
```shell
$ kubectl get pods nginx-2040093540-s8vzu --namespace=limit-example -o yaml | grep resources -C 8
resourceVersion: "57"
selfLink: /api/v1/namespaces/limit-example/pods/nginx-2040093540-ivimu
uid: 67b20741-f53b-11e5-b066-64510658e388
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
resources:
limits:
cpu: 300m
memory: 200Mi
requests:
cpu: 200m
memory: 100Mi
terminationMessagePath: /dev/termination-log
volumeMounts:
```
Note that our nginx container has picked up the namespace default CPU and memory resource *limits* and *requests*.
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 CPU cores.
```shell
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
```
Let's create a pod that falls within the allowed limit boundaries.
```shell
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
pod "valid-pod" created
```
Now look at the Pod's resources field:
```shell
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
uid: 3b1bfd7a-f53c-11e5-b066-64510658e388
spec:
containers:
- image: gcr.io/google_containers/serve_hostname
imagePullPolicy: Always
name: kubernetes-serve-hostname
resources:
limits:
cpu: "1"
memory: 512Mi
requests:
cpu: "1"
memory: 512Mi
```
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
default values.
Note: The *limits* for CPU resource are enforced in the default Kubernetes setup on the physical node
that runs the container unless the administrator deploys the kubelet with the following flag:
```shell
$ kubelet --help
Usage of kubelet
....
--cpu-cfs-quota[=true]: Enable CPU CFS quota enforcement for containers that specify CPU limits
$ kubelet --cpu-cfs-quota=false ...
```
## Step 4: Cleanup
To remove the resources used by this example, you can just delete the limit-example namespace.
```shell
$ kubectl delete namespace limit-example
namespace "limit-example" deleted
$ kubectl get namespaces
NAME STATUS AGE
default Active 12m
```
## Summary
Cluster operators that want to restrict the amount of resources a single container or pod may consume
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to
constrain the amount of resource a pod consumes on a node.

View File

@ -0,0 +1,248 @@
---
assignees:
- Random-Liu
- dchen1107
title: Monitoring Node Health
---
* TOC
{:toc}
## Node Problem Detector
*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
node health. It collects node problems from various daemons and reports them
to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
[Event](/docs/api-reference/v1/definitions/#_v1_event).
It supports some known kernel issue detection now, and will detect more and
more node problems over time.
Currently Kubernetes won't take any action on the node conditions and events
generated by node problem detector. In the future, a remedy system could be
introduced to deal with node problems.
See more information
[here](https://github.com/kubernetes/node-problem-detector).
## Limitations
* The kernel issue detection of node problem detector only supports file based
kernel log now. It doesn't support log tools like journald.
* The kernel issue detection of node problem detector has assumption on kernel
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
## Enable/Disable in GCE cluster
Node problem detector is [running as a cluster addon](cluster-large.md/#addon-resources) enabled by default in the
gce cluster.
You can enable/disable it by setting the environment variable
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
## Use in Other Environment
To enable node problem detector in other environment outside of GCE, you can use
either `kubectl` or addon pod.
### Kubectl
This is the recommended way to start node problem detector outside of GCE. It
provides more flexible management, such as overwriting the default
configuration to fit it into your environment or detect
customized node problems.
* **Step 1:** Create `node-problem-detector.yaml`:
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-problem-detector-v0.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
hostNetwork: true
containers:
- name: node-problem-detector
image: gcr.io/google_containers/node-problem-detector:v0.1
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
volumeMounts:
- name: log
mountPath: /log
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
```
***Notice that you should make sure the system log directory is right for your
OS distro.***
* **Step 2:** Start node problem detector with `kubectl`:
```shell
kubectl create -f node-problem-detector.yaml
```
### Addon Pod
This is for those who have their own cluster bootstrap solution, and don't need
to overwrite the default configuration. They could leverage the addon pod to
further automate the deployment.
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
`/etc/kubernetes/addons/node-problem-detector` on master node.
## Overwrite the Configuration
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
is embedded when building the docker image of node problem detector.
However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
following the steps:
* **Step 1:** Change the config files in `config/`.
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
node-problem-detector-config --from-file=config/`.
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-problem-detector-v0.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
hostNetwork: true
containers:
- name: node-problem-detector
image: gcr.io/google_containers/node-problem-detector:v0.1
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
volumeMounts:
- name: log
mountPath: /log
readOnly: true
- name: config # Overwrite the config/ directory with ConfigMap volume
mountPath: /config
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
- name: config # Define ConfigMap volume
configMap:
name: node-problem-detector-config
```
* **Step 4:** Re-create the node problem detector with the new yaml file:
```shell
kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
kubectl create -f node-problem-detector.yaml
```
***Notice that this approach only applies to node problem detector started with `kubectl`.***
For node problem detector running as cluster addon, because addon manager doesn't support
ConfigMap, configuration overwriting is not supported now.
## Kernel Monitor
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
and detects known kernel issues following predefined rules.
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
The rule list is extensible, and you can always extend it by [overwriting the
configuration](/docs/admin/node-problem/#overwrite-the-configuration).
### Add New NodeConditions
To support new node conditions, you can extend the `conditions` field in
`config/kernel-monitor.json` with new condition definition:
```json
{
"type": "NodeConditionType",
"reason": "CamelCaseDefaultNodeConditionReason",
"message": "arbitrary default node condition message"
}
```
### Detect New Problems
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
with new rule definition:
```json
{
"type": "temporary/permanent",
"condition": "NodeConditionOfPermanentIssue",
"reason": "CamelCaseShortReason",
"message": "regexp matching the issue in the kernel log"
}
```
### Change Log Path
Kernel log in different OS distros may locate in different path. The `log`
field in `config/kernel-monitor.json` is the log path inside the container.
You can always configure it to match your OS distro.
### Support Other Log Format
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
plugin to translate kernel log the internal data structure. It is easy to
implement a new translator for a new log format.
## Caveats
It is recommended to run the node problem detector in your cluster to monitor
the node health. However, you should be aware that this will introduce extra
resource overhead on each node. Usually this is fine, because:
* The kernel log is generated relatively slowly.
* Resource limit is set for node problem detector.
* Even under high load, the resource usage is acceptable.
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))

View File

@ -0,0 +1,6 @@
FROM python
RUN pip install redis
COPY ./worker.py /worker.py
COPY ./rediswq.py /rediswq.py
CMD python worker.py

View File

@ -0,0 +1,213 @@
---
title: Fine Parallel Processing Using a Work Queue
---
* TOC
{:toc}
# Example: Job with Work Queue with Pod Per Work Item
In this example, we will run a Kubernetes Job with multiple parallel
worker processes. You may want to be familiar with the basic,
non-parallel, use of [Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) first.
In this example, as each pod is created, it picks up one unit of work
from a task queue, completes it, deletes it from the queue, and exits.
Here is an overview of the steps in this example:
1. **Start a storage service to hold the work queue.** In this example, we use Redis to store
our work items. In the previous example, we used RabbitMQ. In this example, we use Redis and
a custom work-queue client library because AMQP does not provide a good way for clients to
detect when a finite-length work queue is empty. In practice you would set up a store such
as Redis once and reuse it for the work queues of many jobs, and other things.
1. **Create a queue, and fill it with messages.** Each message represents one task to be done. In
this example, a message is just an integer that we will do a lengthy computation on.
1. **Start a Job that works on tasks from the queue**. The Job starts several pods. Each pod takes
one task from the message queue, processes it, and repeats until the end of the queue is reached.
## Starting Redis
For this example, for simplicitly, we will start a single instance of Redis.
See the [Redis Example](https://github.com/kubernetes/kubernetes/tree/master/examples/guestbook) for an example
of deploying Redis scalably and redundantly.
Start a temporary Pod running Redis and a service so we can find it.
```shell
$ kubectl create -f docs/tasks/job/fine-parallel-processing-work-queue/redis-pod.yaml
pod "redis-master" created
$ kubectl create -f docs/tasks/job/fine-parallel-processing-work-queue/redis-service.yaml
service "redis" created
```
If you're not working from the source tree, you could also download [`redis-pod.yaml`](redis-pod.yaml?raw=true) and [`redis-service.yaml`](redis-service.yaml?raw=true) directly.
## Filling the Queue with tasks
Now let's fill the queue with some "tasks". In our example, our tasks are just strings to be
printed.
Start a temporary interactive pod for running the Redis CLI
```shell
$ kubectl run -i --tty temp --image redis --command "/bin/sh"
Waiting for pod default/redis2-c7h78 to be running, status is Pending, pod ready: false
Hit enter for command prompt
```
Now hit enter, start the redis CLI, and create a list with some work items in it.
```
# redis-cli -h redis
redis:6379> rpush job2 "apple"
(integer) 1
redis:6379> rpush job2 "banana"
(integer) 2
redis:6379> rpush job2 "cherry"
(integer) 3
redis:6379> rpush job2 "date"
(integer) 4
redis:6379> rpush job2 "fig"
(integer) 5
redis:6379> rpush job2 "grape"
(integer) 6
redis:6379> rpush job2 "lemon"
(integer) 7
redis:6379> rpush job2 "melon"
(integer) 8
redis:6379> rpush job2 "orange"
(integer) 9
redis:6379> lrange job2 0 -1
1) "apple"
2) "banana"
3) "cherry"
4) "date"
5) "fig"
6) "grape"
7) "lemon"
8) "melon"
9) "orange"
```
So, the list with key `job2` will be our work queue.
Note: if you do not have Kube DNS setup correctly, you may need to change
the first step of the above block to `redis-cli -h $REDIS_SERVICE_HOST`.
## Create an Image
Now we are ready to create an image that we will run.
We will use a python worker program with a redis client to read
the messages from the message queue.
A simple Redis work queue client library is provided,
called rediswq.py ([Download](rediswq.py?raw=true)).
The "worker" program in each Pod of the Job uses the work queue
client library to get work. Here it is:
{% include code.html language="python" file="worker.py" ghlink="/docs/tasks/job/fine-parallel-processing-work-queue/worker.py" %}
If you are working from the source tree,
change directory to the `docs/tasks/job/fine-parallel-processing-work-queue/` directory.
Otherwise, download [`worker.py`](worker.py?raw=true), [`rediswq.py`](rediswq.py?raw=true), and [`Dockerfile`](Dockerfile?raw=true)
using above links. Then build the image:
```shell
docker build -t job-wq-2 .
```
### Push the image
For the [Docker Hub](https://hub.docker.com/), tag your app image with
your username and push to the Hub with the below commands. Replace
`<username>` with your Hub username.
```shell
docker tag job-wq-2 <username>/job-wq-2
docker push <username>/job-wq-2
```
You need to push to a public repository or [configure your cluster to be able to access
your private repository](/docs/user-guide/images).
If you are using [Google Container
Registry](https://cloud.google.com/tools/container-registry/), tag
your app image with your project ID, and push to GCR. Replace
`<project>` with your project ID.
```shell
docker tag job-wq-2 gcr.io/<project>/job-wq-2
gcloud docker push gcr.io/<project>/job-wq-2
```
## Defining a Job
Here is the job definition:
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/tasks/job/fine-parallel-processing-work-queue/job.yaml" %}
Be sure to edit the job template to
change `gcr.io/myproject` to your own path.
In this example, each pod works on several items from the queue and then exits when there are no more items.
Since the workers themselves detect when the workqueue is empty, and the Job controller does not
know about the workqueue, it relies on the workers to signal when they are done working.
The workers signal that the queue is empty by exiting with success. So, as soon as any worker
exits with success, the controller knows the work is done, and the Pods will exit soon.
So, we set the completion count of the Job to 1. The job controller will wait for the other pods to complete
too.
## Running the Job
So, now run the Job:
```shell
kubectl create -f ./job.yaml
```
Now wait a bit, then check on the job.
```shell
$ kubectl describe jobs/job-wq-2
Name: job-wq-2
Namespace: default
Image(s): gcr.io/exampleproject/job-wq-2
Selector: app in (job-wq-2)
Parallelism: 2
Completions: Unset
Start Time: Mon, 11 Jan 2016 17:07:59 -0800
Labels: app=job-wq-2
Pods Statuses: 1 Running / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
33s 33s 1 {job-controller } Normal SuccessfulCreate Created pod: job-wq-2-lglf8
$ kubectl logs pods/job-wq-2-7r7b2
Worker with sessionID: bbd72d0a-9e5c-4dd6-abf6-416cc267991f
Initial queue state: empty=False
Working on banana
Working on date
Working on lemon
```
As you can see, one of our pods worked on several work units.
## Alternatives
If running a queue service or modifying your containers to use a work queue is inconvenient, you may
want to consider one of the other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns).
If you have a continuous stream of background processing work to run, then
consider running your background workers with a `replicationController` instead,
and consider running a background processing library such as
https://github.com/resque/resque.

View File

@ -0,0 +1,14 @@
apiVersion: batch/v1
kind: Job
metadata:
name: job-wq-2
spec:
parallelism: 2
template:
metadata:
name: job-wq-2
spec:
containers:
- name: c
image: gcr.io/myproject/job-wq-2
restartPolicy: OnFailure

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Pod
metadata:
name: redis-master
labels:
app: redis
spec:
containers:
- name: master
image: redis
env:
- name: MASTER
value: "true"
ports:
- containerPort: 6379

View File

@ -0,0 +1,10 @@
apiVersion: v1
kind: Service
metadata:
name: redis
spec:
ports:
- port: 6379
targetPort: 6379
selector:
app: redis

View File

@ -0,0 +1,130 @@
#!/usr/bin/env python
# Based on http://peter-hoffmann.com/2012/python-simple-queue-redis-queue.html
# and the suggestion in the redis documentation for RPOPLPUSH, at
# http://redis.io/commands/rpoplpush, which suggests how to implement a work-queue.
import redis
import uuid
import hashlib
class RedisWQ(object):
"""Simple Finite Work Queue with Redis Backend
This work queue is finite: as long as no more work is added
after workers start, the workers can detect when the queue
is completely empty.
The items in the work queue are assumed to have unique values.
This object is not intended to be used by multiple threads
concurrently.
"""
def __init__(self, name, **redis_kwargs):
"""The default connection parameters are: host='localhost', port=6379, db=0
The work queue is identified by "name". The library may create other
keys with "name" as a prefix.
"""
self._db = redis.StrictRedis(**redis_kwargs)
# The session ID will uniquely identify this "worker".
self._session = str(uuid.uuid4())
# Work queue is implemented as two queues: main, and processing.
# Work is initially in main, and moved to processing when a client picks it up.
self._main_q_key = name
self._processing_q_key = name + ":processing"
self._lease_key_prefix = name + ":leased_by_session:"
def sessionID(self):
"""Return the ID for this session."""
return self._session
def _main_qsize(self):
"""Return the size of the main queue."""
return self._db.llen(self._main_q_key)
def _processing_qsize(self):
"""Return the size of the main queue."""
return self._db.llen(self._processing_q_key)
def empty(self):
"""Return True if the queue is empty, including work being done, False otherwise.
False does not necessarily mean that there is work available to work on right now,
"""
return self._main_qsize() == 0 and self._processing_qsize() == 0
# TODO: implement this
# def check_expired_leases(self):
# """Return to the work queueReturn True if the queue is empty, False otherwise."""
# # Processing list should not be _too_ long since it is approximately as long
# # as the number of active and recently active workers.
# processing = self._db.lrange(self._processing_q_key, 0, -1)
# for item in processing:
# # If the lease key is not present for an item (it expired or was
# # never created because the client crashed before creating it)
# # then move the item back to the main queue so others can work on it.
# if not self._lease_exists(item):
# TODO: transactionally move the key from processing queue to
# to main queue, while detecting if a new lease is created
# or if either queue is modified.
def _itemkey(self, item):
"""Returns a string that uniquely identifies an item (bytes)."""
return hashlib.sha224(item).hexdigest()
def _lease_exists(self, item):
"""True if a lease on 'item' exists."""
return self._db.exists(self._lease_key_prefix + self._itemkey(item))
def lease(self, lease_secs=60, block=True, timeout=None):
"""Begin working on an item the work queue.
Lease the item for lease_secs. After that time, other
workers may consider this client to have crashed or stalled
and pick up the item instead.
If optional args block is true and timeout is None (the default), block
if necessary until an item is available."""
if block:
item = self._db.brpoplpush(self._main_q_key, self._processing_q_key, timeout=timeout)
else:
item = self._db.rpoplpush(self._main_q_key, self._processing_q_key)
if item:
# Record that we (this session id) are working on a key. Expire that
# note after the lease timeout.
# Note: if we crash at this line of the program, then GC will see no lease
# for this item a later return it to the main queue.
itemkey = self._itemkey(item)
self._db.setex(self._lease_key_prefix + itemkey, lease_secs, self._session)
return item
def complete(self, value):
"""Complete working on the item with 'value'.
If the lease expired, the item may not have completed, and some
other worker may have picked it up. There is no indication
of what happened.
"""
self._db.lrem(self._processing_q_key, 0, value)
# If we crash here, then the GC code will try to move the value, but it will
# not be here, which is fine. So this does not need to be a transaction.
itemkey = self._itemkey(value)
self._db.delete(self._lease_key_prefix + itemkey, self._session)
# TODO: add functions to clean up all keys associated with "name" when
# processing is complete.
# TODO: add a function to add an item to the queue. Atomically
# check if the queue is empty and if so fail to add the item
# since other workers might think work is done and be in the process
# of exiting.
# TODO(etune): move to my own github for hosting, e.g. github.com/erictune/rediswq-py and
# make it so it can be pip installed by anyone (see
# http://stackoverflow.com/questions/8247605/configuring-so-that-pip-install-can-work-from-github)
# TODO(etune): finish code to GC expired leases, and call periodically
# e.g. each time lease times out.

View File

@ -0,0 +1,23 @@
#!/usr/bin/env python
import time
import rediswq
host="redis"
# Uncomment next two lines if you do not have Kube-DNS working.
# import os
# host = os.getenv("REDIS_SERVICE_HOST")
q = rediswq.RedisWQ(name="job2", host="redis")
print("Worker with sessionID: " + q.sessionID())
print("Initial queue state: empty=" + str(q.empty()))
while not q.empty():
item = q.lease(lease_secs=10, block=True, timeout=2)
if item is not None:
itemstr = item.decode("utf=8")
print("Working on " + itemstr)
time.sleep(10) # Put your actual work here instead of sleep.
q.complete(item)
else:
print("Waiting for work")
print("Queue empty, exiting")

18
docs/tasks/job/job.yaml Normal file
View File

@ -0,0 +1,18 @@
apiVersion: batch/v1
kind: Job
metadata:
name: process-item-$ITEM
labels:
jobgroup: jobexample
spec:
template:
metadata:
name: jobexample
labels:
jobgroup: jobexample
spec:
containers:
- name: c
image: busybox
command: ["sh", "-c", "echo Processing item $ITEM && sleep 5"]
restartPolicy: Never

View File

@ -0,0 +1,195 @@
---
title: Parallel Processing using Expansions
---
* TOC
{:toc}
# Example: Multiple Job Objects from Template Expansion
In this example, we will run multiple Kubernetes Jobs created from
a common template. You may want to be familiar with the basic,
non-parallel, use of [Jobs](/docs/concepts/jobs/run-to-completion-finite-workloads/) first.
## Basic Template Expansion
First, download the following template of a job to a file called `job.yaml.txt`
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/tasks/job/parallel-processing-expansion/job.yaml" %}
Unlike a *pod template*, our *job template* is not a Kubernetes API type. It is just
a yaml representation of a Job object that has some placeholders that need to be filled
in before it can be used. The `$ITEM` syntax is not meaningful to Kubernetes.
In this example, the only processing the container does is to `echo` a string and sleep for a bit.
In a real use case, the processing would be some substantial computation, such as rendering a frame
of a movie, or processing a range of rows in a database. The "$ITEM" parameter would specify for
example, the frame number or the row range.
This Job and its Pod template have a label: `jobgroup=jobexample`. There is nothing special
to the system about this label. This label
makes it convenient to operate on all the jobs in this group at once.
We also put the same label on the pod template so that we can check on all Pods of these Jobs
with a single command.
After the job is created, the system will add more labels that distinguish one Job's pods
from another Job's pods.
Note that the label key `jobgroup` is not special to Kubernetes. You can pick your own label scheme.
Next, expand the template into multiple files, one for each item to be processed.
```shell
# Expand files into a temporary directory
mkdir ./jobs
for i in apple banana cherry
do
cat job.yaml.txt | sed "s/\$ITEM/$i/" > ./jobs/job-$i.yaml
done
```
Check if it worked:
```shell
$ ls jobs/
job-apple.yaml
job-banana.yaml
job-cherry.yaml
```
Here, we used `sed` to replace the string `$ITEM` with the loop variable.
You could use any type of template language (jinja2, erb) or write a program
to generate the Job objects.
Next, create all the jobs with one kubectl command:
```shell
$ kubectl create -f ./jobs
job "process-item-apple" created
job "process-item-banana" created
job "process-item-cherry" created
```
Now, check on the jobs:
```shell
$ kubectl get jobs -l jobgroup=jobexample
JOB CONTAINER(S) IMAGE(S) SELECTOR SUCCESSFUL
process-item-apple c busybox app in (jobexample),item in (apple) 1
process-item-banana c busybox app in (jobexample),item in (banana) 1
process-item-cherry c busybox app in (jobexample),item in (cherry) 1
```
Here we use the `-l` option to select all jobs that are part of this
group of jobs. (There might be other unrelated jobs in the system that we
do not care to see.)
We can check on the pods as well using the same label selector:
```shell
$ kubectl get pods -l jobgroup=jobexample --show-all
NAME READY STATUS RESTARTS AGE
process-item-apple-kixwv 0/1 Completed 0 4m
process-item-banana-wrsf7 0/1 Completed 0 4m
process-item-cherry-dnfu9 0/1 Completed 0 4m
```
There is not a single command to check on the output of all jobs at once,
but looping over all the pods is pretty easy:
```shell
$ for p in $(kubectl get pods -l jobgroup=jobexample -o name)
do
kubectl logs $p
done
Processing item apple
Processing item banana
Processing item cherry
```
## Multiple Template Parameters
In the first example, each instance of the template had one parameter, and that parameter was also
used as a label. However label keys are limited in [what characters they can
contain](/docs/user-guide/labels/#syntax-and-character-set).
This slightly more complex example uses the jinja2 template language to generate our objects.
We will use a one-line python script to convert the template to a file.
First, copy and paste the following template of a Job object, into a file called `job.yaml.jinja2`:
```liquid{% raw %}
{%- set params = [{ "name": "apple", "url": "http://www.orangepippin.com/apples", },
{ "name": "banana", "url": "https://en.wikipedia.org/wiki/Banana", },
{ "name": "raspberry", "url": "https://www.raspberrypi.org/" }]
%}
{%- for p in params %}
{%- set name = p["name"] %}
{%- set url = p["url"] %}
apiVersion: batch/v1
kind: Job
metadata:
name: jobexample-{{ name }}
labels:
jobgroup: jobexample
spec:
template:
name: jobexample
labels:
jobgroup: jobexample
spec:
containers:
- name: c
image: busybox
command: ["sh", "-c", "echo Processing URL {{ url }} && sleep 5"]
restartPolicy: Never
---
{%- endfor %}
{% endraw %}
```
The above template defines parameters for each job object using a list of
python dicts (lines 1-4). Then a for loop emits one job yaml object
for each set of parameters (remaining lines).
We take advantage of the fact that multiple yaml documents can be concatenated
with the `---` separator (second to last line).
.) We can pipe the output directly to kubectl to
create the objects.
You will need the jinja2 package if you do not already have it: `pip install --user jinja2`.
Now, use this one-line python program to expand the template:
```shell
alias render_template='python -c "from jinja2 import Template; import sys; print(Template(sys.stdin.read()).render());"'
```
The output can be saved to a file, like this:
```shell
cat job.yaml.jinja2 | render_template > jobs.yaml
```
or sent directly to kubectl, like this:
```shell
cat job.yaml.jinja2 | render_template | kubectl create -f -
```
## Alternatives
If you have a large number of job objects, you may find that:
- even using labels, managing so many Job objects is cumbersome.
- You exceed resource quota when creating all the Jobs at once,
and do not want to wait to create them incrementally.
- You need a way to easily scale the number of pods running
concurrently. One reason would be to avoid using too many
compute resources. Another would be to limit the number of
concurrent requests to a shared resource, such as a database,
used by all the pods in the job.
- very large numbers of jobs created at once overload the
Kubernetes apiserver, controller, or scheduler.
In this case, you can consider one of the
other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns).

View File

@ -0,0 +1,10 @@
# Specify BROKER_URL and QUEUE when running
FROM ubuntu:14.04
RUN apt-get update && \
apt-get install -y curl ca-certificates amqp-tools python \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
COPY ./worker.py /worker.py
CMD /usr/bin/amqp-consume --url=$BROKER_URL -q $QUEUE -c 1 /worker.py

View File

@ -0,0 +1,284 @@
---
title: Coarse Parallel Processing Using a Work Queue
---
* TOC
{:toc}
# Example: Job with Work Queue with Pod Per Work Item
In this example, we will run a Kubernetes Job with multiple parallel
worker processes. You may want to be familiar with the basic,
non-parallel, use of [Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) first.
In this example, as each pod is created, it picks up one unit of work
from a task queue, completes it, deletes it from the queue, and exits.
Here is an overview of the steps in this example:
1. **Start a message queue service.** In this example, we use RabbitMQ, but you could use another
one. In practice you would set up a message queue service once and reuse it for many jobs.
1. **Create a queue, and fill it with messages.** Each message represents one task to be done. In
this example, a message is just an integer that we will do a lengthy computation on.
1. **Start a Job that works on tasks from the queue**. The Job starts several pods. Each pod takes
one task from the message queue, processes it, and repeats until the end of the queue is reached.
## Starting a message queue service
This example uses RabbitMQ, but it should be easy to adapt to another AMQP-type message service.
In practice you could set up a message queue service once in a
cluster and reuse it for many jobs, as well as for long-running services.
Start RabbitMQ as follows:
```shell
$ kubectl create -f examples/celery-rabbitmq/rabbitmq-service.yaml
service "rabbitmq-service" created
$ kubectl create -f examples/celery-rabbitmq/rabbitmq-controller.yaml
replicationController "rabbitmq-controller" created
```
We will only use the rabbitmq part from the [celery-rabbitmq example](https://github.com/kubernetes/kubernetes/tree/release-1.3/examples/celery-rabbitmq).
## Testing the message queue service
Now, we can experiment with accessing the message queue. We will
create a temporary interactive pod, install some tools on it,
and experiment with queues.
First create a temporary interactive Pod.
```shell
# Create a temporary interactive container
$ kubectl run -i --tty temp --image ubuntu:14.04
Waiting for pod default/temp-loe07 to be running, status is Pending, pod ready: false
... [ previous line repeats several times .. hit return when it stops ] ...
```
Note that your pod name and command prompt will be different.
Next install the `amqp-tools` so we can work with message queues.
```shell
# Install some tools
root@temp-loe07:/# apt-get update
.... [ lots of output ] ....
root@temp-loe07:/# apt-get install -y curl ca-certificates amqp-tools python dnsutils
.... [ lots of output ] ....
```
Later, we will make a docker image that includes these packages.
Next, we will check that we can discover the rabbitmq service:
```
# Note the rabitmq-service has a DNS name, provided by Kubernetes:
root@temp-loe07:/# nslookup rabbitmq-service
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: rabbitmq-service.default.svc.cluster.local
Address: 10.0.147.152
# Your address will vary.
```
If Kube-DNS is not setup correctly, the previous step may not work for you.
You can also find the service IP in an env var:
```
# env | grep RABBIT | grep HOST
RABBITMQ_SERVICE_SERVICE_HOST=10.0.147.152
# Your address will vary.
```
Next we will verify we can create a queue, and publish and consume messages.
```shell
# In the next line, rabbitmq-service is the hostname where the rabbitmq-service
# can be reached. 5672 is the standard port for rabbitmq.
root@temp-loe07:/# export BROKER_URL=amqp://guest:guest@rabbitmq-service:5672
# If you could not resolve "rabbitmq-service" in the previous step,
# then use this command instead:
# root@temp-loe07:/# BROKER_URL=amqp://guest:guest@$RABBITMQ_SERVICE_SERVICE_HOST:5672
# Now create a queue:
root@temp-loe07:/# /usr/bin/amqp-declare-queue --url=$BROKER_URL -q foo -d
foo
# Publish one message to it:
root@temp-loe07:/# /usr/bin/amqp-publish --url=$BROKER_URL -r foo -p -b Hello
# And get it back.
root@temp-loe07:/# /usr/bin/amqp-consume --url=$BROKER_URL -q foo -c 1 cat && echo
Hello
root@temp-loe07:/#
```
In the last command, the `amqp-consume` tool takes one message (`-c 1`)
from the queue, and passes that message to the standard input of an arbitrary command. In this case, the program `cat` is just printing
out what it gets on the standard input, and the echo is just to add a carriage
return so the example is readable.
## Filling the Queue with tasks
Now lets fill the queue with some "tasks". In our example, our tasks are just strings to be
printed.
In a practice, the content of the messages might be:
- names of files to that need to be processed
- extra flags to the program
- ranges of keys in a database table
- configuration parameters to a simulation
- frame numbers of a scene to be rendered
In practice, if there is large data that is needed in a read-only mode by all pods
of the Job, you will typically put that in a shared file system like NFS and mount
that readonly on all the pods, or the program in the pod will natively read data from
a cluster file system like HDFS.
For our example, we will create the queue and fill it using the amqp command line tools.
In practice, you might write a program to fill the queue using an amqp client library.
```shell
$ /usr/bin/amqp-declare-queue --url=$BROKER_URL -q job1 -d
job1
$ for f in apple banana cherry date fig grape lemon melon
do
/usr/bin/amqp-publish --url=$BROKER_URL -r job1 -p -b $f
done
```
So, we filled the queue with 8 messages.
## Create an Image
Now we are ready to create an image that we will run as a job.
We will use the `amqp-consume` utility to read the message
from the queue and run our actual program. Here is a very simple
example program:
{% include code.html language="python" file="worker.py" ghlink="/docs/tasks/job/work-queue-1/worker.py" %}
Now, build an image. If you are working in the source
tree, then change directory to `examples/job/work-queue-1`.
Otherwise, make a temporary directory, change to it,
download the [Dockerfile](Dockerfile?raw=true),
and [worker.py](worker.py?raw=true). In either case,
build the image with this command: `
```shell
$ docker build -t job-wq-1 .
```
For the [Docker Hub](https://hub.docker.com/), tag your app image with
your username and push to the Hub with the below commands. Replace
`<username>` with your Hub username.
```shell
docker tag job-wq-1 <username>/job-wq-1
docker push <username>/job-wq-1
```
If you are using [Google Container
Registry](https://cloud.google.com/tools/container-registry/), tag
your app image with your project ID, and push to GCR. Replace
`<project>` with your project ID.
```shell
docker tag job-wq-1 gcr.io/<project>/job-wq-1
gcloud docker push gcr.io/<project>/job-wq-1
```
## Defining a Job
Here is a job definition. You'll need to make a copy of the Job and edit the
image to match the name you used, and call it `./job.yaml`.
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/tasks/job/work-queue-1/job.yaml" %}
In this example, each pod works on one item from the queue and then exits.
So, the completion count of the Job corresponds to the number of work items
done. So we set, `.spec.completions: 8` for the example, since we put 8 items in the queue.
## Running the Job
So, now run the Job:
```shell
kubectl create -f ./job.yaml
```
Now wait a bit, then check on the job.
```shell
$ kubectl describe jobs/job-wq-1
Name: job-wq-1
Namespace: default
Image(s): gcr.io/causal-jigsaw-637/job-wq-1
Selector: app in (job-wq-1)
Parallelism: 2
Completions: 8
Labels: app=job-wq-1
Pods Statuses: 0 Running / 8 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
───────── ──────── ───── ──── ───────────── ────── ───────
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-hcobb
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-weytj
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-qaam5
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-b67sr
26s 26s 1 {job } SuccessfulCreate Created pod: job-wq-1-xe5hj
15s 15s 1 {job } SuccessfulCreate Created pod: job-wq-1-w2zqe
14s 14s 1 {job } SuccessfulCreate Created pod: job-wq-1-d6ppa
14s 14s 1 {job } SuccessfulCreate Created pod: job-wq-1-p17e0
```
All our pods succeeded. Yay.
## Alternatives
This approach has the advantage that you
do not need to modify your "worker" program to be aware that there is a work queue.
It does require that you run a message queue service.
If running a queue service is inconvenient, you may
want to consider one of the other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns).
This approach creates a pod for every work item. If your work items only take a few seconds,
though, creating a Pod for every work item may add a lot of overhead. Consider another
[example](/docs/tasks/job/fine-parallel-processing-work-queue/), that executes multiple work items per Pod.
In this example, we used use the `amqp-consume` utility to read the message
from the queue and run our actual program. This has the advantage that you
do not need to modify your program to be aware of the queue.
A [different example](/docs/tasks/job/fine-parallel-processing-work-queue/), shows how to
communicate with the work queue using a client library.
## Caveats
If the number of completions is set to less than the number of items in the queue, then
not all items will be processed.
If the number of completions is set to more than the number of items in the queue,
then the Job will not appear to be completed, even though all items in the queue
have been processed. It will start additional pods which will block waiting
for a message.
There is an unlikely race with this pattern. If the container is killed in between the time
that the message is acknowledged by the amqp-consume command and the time that the container
exits with success, or if the node crashes before the kubelet is able to post the success of the pod
back to the api-server, then the Job will not appear to be complete, even though all items
in the queue have been processed.

View File

@ -0,0 +1,20 @@
apiVersion: batch/v1
kind: Job
metadata:
name: job-wq-1
spec:
completions: 8
parallelism: 2
template:
metadata:
name: job-wq-1
spec:
containers:
- name: c
image: gcr.io/<project>/job-wq-1
env:
- name: BROKER_URL
value: amqp://guest:guest@rabbitmq-service:5672
- name: QUEUE
value: job1
restartPolicy: OnFailure

View File

@ -0,0 +1,7 @@
#!/usr/bin/env python
# Just prints standard out and sleeps for 10 seconds.
import sys
import time
print("Processing " + sys.stdin.lines())
time.sleep(10)

View File

@ -47,7 +47,7 @@ kubectl scale statefulsets <stateful-set-name> --replicas=<new-replicas>
### Alternative: `kubectl apply` / `kubectl edit` / `kubectl patch`
Alternatively, you can do [in-place updates](/docs/user-guide/managing-deployments/#in-place-updates-of-resources) on your StatefulSets.
Alternatively, you can do [in-place updates](/docs/concepts/cluster-administration/manage-deployment/#in-place-updates-of-resources) on your StatefulSets.
If your StatefulSet was initially created with `kubectl apply` or `kubectl create --save-config`,
update `.spec.replicas` of the StatefulSet manifests, and then do a `kubectl apply`:

View File

@ -0,0 +1,256 @@
---
assignees:
- janetkuo
title: Rolling Update Replication Controller
---
* TOC
{:toc}
## Overview
To update a service without an outage, `kubectl` supports what is called ['rolling update'](/docs/user-guide/kubectl/kubectl_rolling-update), which updates one pod at a time, rather than taking down the entire service at the same time. See the [rolling update design document](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/simple-rolling-update.md) and the [example of rolling update](/docs/tasks/run-application/rolling-update-replication-controller/) for more information.
Note that `kubectl rolling-update` only supports Replication Controllers. However, if you deploy applications with Replication Controllers,
consider switching them to [Deployments](/docs/user-guide/deployments/). A Deployment is a higher-level controller that automates rolling updates
of applications declaratively, and therefore is recommended. If you still want to keep your Replication Controllers and use `kubectl rolling-update`, keep reading:
A rolling update applies changes to the configuration of pods being managed by
a replication controller. The changes can be passed as a new replication
controller configuration file; or, if only updating the image, a new container
image can be specified directly.
A rolling update works by:
1. Creating a new replication controller with the updated configuration.
2. Increasing/decreasing the replica count on the new and old controllers until
the correct number of replicas is reached.
3. Deleting the original replication controller.
Rolling updates are initiated with the `kubectl rolling-update` command:
$ kubectl rolling-update NAME \
([NEW_NAME] --image=IMAGE | -f FILE)
## Passing a configuration file
To initiate a rolling update using a configuration file, pass the new file to
`kubectl rolling-update`:
$ kubectl rolling-update NAME -f FILE
The configuration file must:
* Specify a different `metadata.name` value.
* Overwrite at least one common label in its `spec.selector` field.
* Use the same `metadata.namespace`.
Replication controller configuration files are described in
[Creating Replication Controllers](/docs/user-guide/replication-controller/operations/).
### Examples
// Update pods of frontend-v1 using new replication controller data in frontend-v2.json.
$ kubectl rolling-update frontend-v1 -f frontend-v2.json
// Update pods of frontend-v1 using JSON data passed into stdin.
$ cat frontend-v2.json | kubectl rolling-update frontend-v1 -f -
## Updating the container image
To update only the container image, pass a new image name and tag with the
`--image` flag and (optionally) a new controller name:
$ kubectl rolling-update NAME [NEW_NAME] --image=IMAGE:TAG
The `--image` flag is only supported for single-container pods. Specifying
`--image` with multi-container pods returns an error.
If no `NEW_NAME` is specified, a new replication controller is created with
a temporary name. Once the rollout is complete, the old controller is deleted,
and the new controller is updated to use the original name.
The update will fail if `IMAGE:TAG` is identical to the
current value. For this reason, we recommend the use of versioned tags as
opposed to values such as `:latest`. Doing a rolling update from `image:latest`
to a new `image:latest` will fail, even if the image at that tag has changed.
Moreover, the use of `:latest` is not recommended, see
[Best Practices for Configuration](/docs/concepts/configuration/overview/#container-images) for more information.
### Examples
// Update the pods of frontend-v1 to frontend-v2
$ kubectl rolling-update frontend-v1 frontend-v2 --image=image:v2
// Update the pods of frontend, keeping the replication controller name
$ kubectl rolling-update frontend --image=image:v2
## Required and optional fields
Required fields are:
* `NAME`: The name of the replication controller to update.
as well as either:
* `-f FILE`: A replication controller configuration file, in either JSON or
YAML format. The configuration file must specify a new top-level `id` value
and include at least one of the existing `spec.selector` key:value pairs.
See the
[Run Stateless AP Replication Controller](/docs/tutorials/stateless-application/run-stateless-ap-replication-controller/#replication-controller-configuration-file)
page for details.
<br>
<br>
or:
<br>
<br>
* `--image IMAGE:TAG`: The name and tag of the image to update to. Must be
different than the current image:tag currently specified.
Optional fields are:
* `NEW_NAME`: Only used in conjunction with `--image` (not with `-f FILE`). The
name to assign to the new replication controller.
* `--poll-interval DURATION`: The time between polling the controller status
after update. Valid units are `ns` (nanoseconds), `us` or `µs` (microseconds),
`ms` (milliseconds), `s` (seconds), `m` (minutes), or `h` (hours). Units can
be combined (e.g. `1m30s`). The default is `3s`.
* `--timeout DURATION`: The maximum time to wait for the controller to update a
pod before exiting. Default is `5m0s`. Valid units are as described for
`--poll-interval` above.
* `--update-period DURATION`: The time to wait between updating pods. Default
is `1m0s`. Valid units are as described for `--poll-interval` above.
Additional information about the `kubectl rolling-update` command is available
from the [`kubectl` reference](/docs/user-guide/kubectl/kubectl_rolling-update/).
## Walkthrough
Let's say you were running version 1.7.9 of nginx:
```yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: my-nginx
spec:
replicas: 5
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
```
To update to version 1.9.1, you can use [`kubectl rolling-update --image`](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/simple-rolling-update.md) to specify the new image:
```shell
$ kubectl rolling-update my-nginx --image=nginx:1.9.1
Created my-nginx-ccba8fbd8cc8160970f63f9a2696fc46
```
In another window, you can see that `kubectl` added a `deployment` label to the pods, whose value is a hash of the configuration, to distinguish the new pods from the old:
```shell
$ kubectl get pods -l app=nginx -L deployment
NAME READY STATUS RESTARTS AGE DEPLOYMENT
my-nginx-ccba8fbd8cc8160970f63f9a2696fc46-k156z 1/1 Running 0 1m ccba8fbd8cc8160970f63f9a2696fc46
my-nginx-ccba8fbd8cc8160970f63f9a2696fc46-v95yh 1/1 Running 0 35s ccba8fbd8cc8160970f63f9a2696fc46
my-nginx-divi2 1/1 Running 0 2h 2d1d7a8f682934a254002b56404b813e
my-nginx-o0ef1 1/1 Running 0 2h 2d1d7a8f682934a254002b56404b813e
my-nginx-q6all 1/1 Running 0 8m 2d1d7a8f682934a254002b56404b813e
```
`kubectl rolling-update` reports progress as it progresses:
```
Scaling up my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 from 0 to 3, scaling down my-nginx from 3 to 0 (keep 3 pods available, don't exceed 4 pods)
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 up to 1
Scaling my-nginx down to 2
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 up to 2
Scaling my-nginx down to 1
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 up to 3
Scaling my-nginx down to 0
Update succeeded. Deleting old controller: my-nginx
Renaming my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 to my-nginx
replicationcontroller "my-nginx" rolling updated
```
If you encounter a problem, you can stop the rolling update midway and revert to the previous version using `--rollback`:
```shell
$ kubectl rolling-update my-nginx --rollback
Setting "my-nginx" replicas to 1
Continuing update with existing controller my-nginx.
Scaling up nginx from 1 to 1, scaling down my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 down to 0
Update succeeded. Deleting my-nginx-ccba8fbd8cc8160970f63f9a2696fc46
replicationcontroller "my-nginx" rolling updated
```
This is one example where the immutability of containers is a huge asset.
If you need to update more than just the image (e.g., command arguments, environment variables), you can create a new replication controller, with a new name and distinguishing label value, such as:
```yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: my-nginx-v4
spec:
replicas: 5
selector:
app: nginx
deployment: v4
template:
metadata:
labels:
app: nginx
deployment: v4
spec:
containers:
- name: nginx
image: nginx:1.9.2
args: ["nginx", "-T"]
ports:
- containerPort: 80
```
and roll it out:
```shell
$ kubectl rolling-update my-nginx -f ./nginx-rc.yaml
Created my-nginx-v4
Scaling up my-nginx-v4 from 0 to 5, scaling down my-nginx from 4 to 0 (keep 4 pods available, don't exceed 5 pods)
Scaling my-nginx-v4 up to 1
Scaling my-nginx down to 3
Scaling my-nginx-v4 up to 2
Scaling my-nginx down to 2
Scaling my-nginx-v4 up to 3
Scaling my-nginx down to 1
Scaling my-nginx-v4 up to 4
Scaling my-nginx down to 0
Scaling my-nginx-v4 up to 5
Update succeeded. Deleting old controller: my-nginx
replicationcontroller "my-nginx-v4" rolling updated
```
You can also run the [update demo](/docs/tasks/run-application/rolling-update-replication-controller/) to see a visual representation of the rolling update process.
## Troubleshooting
If the `timeout` duration is reached during a rolling update, the operation will
fail with some pods belonging to the new replication controller, and some to the
original controller.
To continue the update from where it failed, retry using the same command.
To roll back to the original state before the attempted update, append the
`--rollback=true` flag to the original command. This will revert all changes.

View File

@ -0,0 +1,393 @@
---
assignees:
- stclair
title: AppArmor
---
AppArmor is a Linux kernel enhancement that can reduce the potential attack surface of an
application and provide greater defense in depth for Applications. Beta support for AppArmor was
added in Kubernetes v1.4.
* TOC
{:toc}
## What is AppArmor
AppArmor is a Linux kernel security module that supplements the standard Linux user and group based
permissions to confine programs to a limited set of resources. AppArmor can be configured for any
application to reduce its potential attack surface and provide greater defense in depth. It is
configured through profiles tuned to whitelist the access needed by a specific program or container,
such as Linux capabilities, network access, file permissions, etc. Each profile can be run in either
enforcing mode, which blocks access to disallowed resources, or complain mode, which only reports
violations.
AppArmor can help you to run a more secure deployment by restricting what containers are allowed to
do, and /or providing better auditing through system logs. However, it is important to keep in mind
that AppArmor is not a silver bullet, and can only do so much to protect against exploits in your
application code. It is important to provide good, restrictive profiles, and harden your
applications and cluster from other angles as well.
AppArmor support in Kubernetes is currently in beta.
## Prerequisites
1. **Kubernetes version is at least v1.4**. Kubernetes support for AppArmor was added in
v1.4. Kubernetes components older than v1.4 are not aware of the new AppArmor annotations, and
will **silently ignore** any AppArmor settings that are provided. To ensure that your Pods are
receiving the expected protections, it is important to verify the Kubelet version of your nodes:
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.kubeletVersion}\n{end}'
gke-test-default-pool-239f5d02-gyn2: v1.4.0
gke-test-default-pool-239f5d02-x1kf: v1.4.0
gke-test-default-pool-239f5d02-xwux: v1.4.0
2. **AppArmor kernel module is enabled**. For the Linux kernel to enforce an AppArmor profile, the
AppArmor kernel module must be installed and enabled. Several distributions enable the module by
default, such as Ubuntu and SUSE, and many others provide optional support. To check whether the
module is enabled, check the `/sys/module/apparmor/parameters/enabled` file:
$ cat /sys/module/apparmor/parameters/enabled
Y
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
options if the kernel module is not enabled.
*Note: Ubuntu carries many AppArmor patches that have not been merged into the upstream Linux
kernel, including patches that add additional hooks and features. Kubernetes has only been
tested with the upstream version, and does not promise support for other features.*
3. **Container runtime is Docker**. Currently the only Kubernetes-supported container runtime that
also supports AppArmor is Docker. As more runtimes add AppArmor support, the options will be
expanded. You can verify that your nodes are running docker with:
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.containerRuntimeVersion}\n{end}'
gke-test-default-pool-239f5d02-gyn2: docker://1.11.2
gke-test-default-pool-239f5d02-x1kf: docker://1.11.2
gke-test-default-pool-239f5d02-xwux: docker://1.11.2
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
options if the runtime is not Docker.
4. **Profile is loaded**. AppArmor is applied to a Pod by specifying an AppArmor profile that each
container should be run with. If any of the specified profiles is not already loaded in the
kernel, the Kubelet (>= v1.4) will reject the Pod. You can view which profiles are loaded on a
node by checking the `/sys/kernel/security/apparmor/profiles` file. For example:
$ ssh gke-test-default-pool-239f5d02-gyn2 "sudo cat /sys/kernel/security/apparmor/profiles | sort"
apparmor-test-deny-write (enforce)
apparmor-test-audit-write (enforce)
docker-default (enforce)
k8s-nginx (enforce)
For more details on loading profiles on nodes, see
[Setting up nodes with profiles](#setting-up-nodes-with-profiles).
As long as the Kubelet version includes AppArmor support (>= v1.4), the Kubelet will reject a Pod
with AppArmor options if any of the prerequisites are not met. You can also verify AppArmor support
on nodes by checking the node ready condition message (though this is likely to be removed in a
later release):
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {.status.conditions[?(@.reason=="KubeletReady")].message}\n{end}'
gke-test-default-pool-239f5d02-gyn2: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-x1kf: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-xwux: kubelet is posting ready status. AppArmor enabled
## Securing a Pod
*Note: AppArmor is currently in beta, so options are specified as annotations. Once support graduates to
general availability, the annotations will be replaced with first-class fields (more details in
[Upgrade path to GA](#upgrade-path-to-general-availability)).*
AppArmor profiles are specified *per-container*. To specify the AppArmor profile to run a Pod
container with, add an annotation to the Pod's metadata:
container.apparmor.security.beta.kubernetes.io/<container_name>: <profile_ref>
Where `<container_name>` is the name of the container to apply the profile to, and `<profile_ref>`
specifies the profile to apply. The `profile_ref` can be one of:
- `runtime/default` to apply the runtime's default profile.
- `localhost/<profile_name>` to apply the profile loaded on the host with the name `<profile_name>`
See the [API Reference](#api-reference) for the full details on the annotation and profile name formats.
The Kubernetes AppArmor enforcement works by first checking that all the prerequisites have been
met, and then forwarding the profile selection to the container runtime for enforcement. If the
prerequisites have not been met, the Pod will be rejected, and will not run.
To verify that the profile was applied, you can expect to see the AppArmor security option listed in the container created event:
$ kubectl get events | grep Created
22s 22s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet e2e-test-stclair-minion-group-31nt} Created container with docker id 269a53b202d3; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
You can also verify directly that the container's root process is running with the correct profile by checking its proc attr:
$ kubectl exec <pod_name> cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
## Example
In this example you'll see:
- One way to load a profile on a node
- How to enforce the profile on a Pod
- How to check that the profile is loaded
- What happens when a profile is violated
- What happens when a profile cannot be loaded
*This example assumes you have already set up a cluster with AppArmor support.*
First, we need to load the profile we want to use onto our nodes. The profile we'll use simply
denies all file writes:
{% include code.html language="text" file="deny-write.profile" ghlink="/docs/tutorials/clusters/deny-write.profile" %}
Since we don't know where the Pod will be scheduled, we'll need to load the profile on all our
nodes. For this example we'll just use SSH to install the profiles, but other approaches are
discussed in [Setting up nodes with profiles](#setting-up-nodes-with-profiles).
$ NODES=(
# The SSH-accessible domain names of your nodes
gke-test-default-pool-239f5d02-gyn2.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-x1kf.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-xwux.us-central1-a.my-k8s)
$ for NODE in ${NODES[*]}; do ssh $NODE 'sudo apparmor_parser -q <<EOF
#include <tunables/global>
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
# Deny all file writes.
deny /** w,
}
EOF'
done
Next, we'll run a simple "Hello AppArmor" pod with the deny-write profile:
{% include code.html language="yaml" file="hello-apparmor-pod.yaml" ghlink="/docs/tutorials/clusters/hello-apparmor-pod.yaml" %}
$ kubectl create -f /dev/stdin <<EOF
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor
annotations:
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
spec:
containers:
- name: hello
image: busybox
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
EOF
pod "hello-apparmor" created
If we look at the pod events, we can see that the Pod container was created with the AppArmor
profile "k8s-apparmor-example-deny-write":
$ kubectl get events | grep hello-apparmor
14s 14s 1 hello-apparmor Pod Normal Scheduled {default-scheduler } Successfully assigned hello-apparmor to gke-test-default-pool-239f5d02-gyn2
14s 14s 1 hello-apparmor Pod spec.containers{hello} Normal Pulling {kubelet gke-test-default-pool-239f5d02-gyn2} pulling image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Pulled {kubelet gke-test-default-pool-239f5d02-gyn2} Successfully pulled image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet gke-test-default-pool-239f5d02-gyn2} Created container with docker id 06b6cd1c0989; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Started {kubelet gke-test-default-pool-239f5d02-gyn2} Started container with docker id 06b6cd1c0989
We can verify that the container is actually running with that profile by checking its proc attr:
$ kubectl exec hello-apparmor cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
Finally, we can see what happens if we try to violate the profile by writing to a file:
$ kubectl exec hello-apparmor touch /tmp/test
touch: /tmp/test: Permission denied
error: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1
To wrap up, let's look at what happens if we try to specify a profile that hasn't been loaded:
$ kubectl create -f /dev/stdin <<EOF
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor-2
annotations:
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-allow-write
spec:
containers:
- name: hello
image: busybox
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
EOF
pod "hello-apparmor-2" created
$ kubectl describe pod hello-apparmor-2
Name: hello-apparmor-2
Namespace: default
Node: gke-test-default-pool-239f5d02-x1kf/
Start Time: Tue, 30 Aug 2016 17:58:56 -0700
Labels: <none>
Status: Failed
Reason: AppArmor
Message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
IP:
Controllers: <none>
Containers:
hello:
Image: busybox
Port:
Command:
sh
-c
echo 'Hello AppArmor!' && sleep 1h
Requests:
cpu: 100m
Environment Variables: <none>
Volumes:
default-token-dnz7v:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dnz7v
QoS Tier: Burstable
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
23s 23s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-apparmor-2 to e2e-test-stclair-minion-group-t1f5
23s 23s 1 {kubelet e2e-test-stclair-minion-group-t1f5} Warning AppArmor Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
Note the pod status is Failed, with a helpful error message: `Pod Cannot enforce AppArmor: profile
"k8s-apparmor-example-allow-write" is not loaded`. An event was also recorded with the same message.
## Administration
### Setting up nodes with profiles
Kubernetes does not currently provide any native mechanisms for loading AppArmor profiles onto
nodes. There are lots of ways to setup the profiles though, such as:
- Through a [DaemonSet](../daemons/) that runs a Pod on each node to
ensure the correct profiles are loaded. An example implementation can be found
[here](https://github.com/kubernetes/contrib/tree/master/apparmor/loader).
- At node initialization time, using your node initialization scripts (e.g. Salt, Ansible, etc.) or
image.
- By copying the profiles to each node and loading them through SSH, as demonstrated in the
[Example](#example).
The scheduler is not aware of which profiles are loaded onto which node, so the full set of profiles
must be loaded onto every node. An alternative approach is to add a node label for each profile (or
class of profiles) on the node, and use a
[node selector](../../user-guide/node-selection/) to ensure the Pod is run on a
node with the required profile.
### Restricting profiles with the PodSecurityPolicy
If the PodSecurityPolicy extension is enabled, cluster-wide AppArmor restrictions can be applied. To
enable the PodSecurityPolicy, two flags must be set on the `apiserver`:
--admission-control=PodSecurityPolicy[,others...]
--runtime-config=extensions/v1beta1/podsecuritypolicy[,others...]
With the extension enabled, the AppArmor options can be specified as annotations on the PodSecurityPolicy:
apparmor.security.beta.kubernetes.io/defaultProfileName: <profile_ref>
apparmor.security.beta.kubernetes.io/allowedProfileNames: <profile_ref>[,others...]
The default profile name option specifies the profile to apply to containers by default when none is
specified. The allowed profile names option specifies a list of profiles that Pod containers are
allowed to be run with. If both options are provided, the default must be allowed. The profiles are
specified in the same format as on containers. See the [API Reference](#api-reference) for the full
specification.
### Disabling AppArmor
If you do not want AppArmor to be available on your cluster, it can be disabled by a command-line flag:
--feature-gates=AppArmor=false
When disabled, any Pod that includes an AppArmor profile will fail validation with a "Forbidden"
error. Note that by default docker always enables the "docker-default" profile on non-privileged
pods (if the AppArmor kernel module is enabled), and will continue to do so even if the feature-gate
is disabled. The option to disable AppArmor will be removed when AppArmor graduates to general
availability (GA).
### Upgrading to Kubernetes v1.4 with AppArmor
No action is required with respect to AppArmor to upgrade your cluster to v1.4. However, if any
existing pods had an AppArmor annotation, they will not go through validation (or PodSecurityPolicy
admission). If permissive profiles are loaded on the nodes, a malicious user could pre-apply a
permissive profile to escalate the pod privileges above the docker-default. If this is a concern, it
is recommended to scrub the cluster of any pods containing an annotation with
`apparmor.security.beta.kubernetes.io`.
### Upgrade path to General Availability
When AppArmor is ready to be graduated to general availability (GA), the options currently specified
through annotations will be converted to fields. Supporting all the upgrade and downgrade paths
through the transition is very nuanced, and will be explained in detail when the transition
occurs. We will commit to supporting both fields and annotations for at least 2 releases, and will
explicitly reject the annotations for at least 2 releases after that.
## Authoring Profiles
Getting AppArmor profiles specified correctly can be a tricky business. Fortunately there are some
tools to help with that:
- `aa-genprof` and `aa-logprof` generate profile rules by monitoring an application's activity and
logs, and admitting the actions it takes. Further instructions are provided by the
[AppArmor documentation](http://wiki.apparmor.net/index.php/Profiling_with_tools).
- [bane](https://github.com/jfrazelle/bane) is an AppArmor profile generator for Docker that uses a
simplified profile language.
It is recommended to run your application through Docker on a development workstation to generate
the profiles, but there is nothing preventing running the tools on the Kubernetes node where your
Pod is running.
To debug problems with AppArmor, you can check the system logs to see what, specifically, was
denied. AppArmor logs verbose messages to `dmesg`, and errors can usually be found in the system
logs or through `journalctl`. More information is provided in
[AppArmor failures](http://wiki.apparmor.net/index.php/AppArmor_Failures).
Additional resources:
- [Quick guide to the AppArmor profile language](http://wiki.apparmor.net/index.php/QuickProfileLanguage)
- [AppArmor core policy reference](http://wiki.apparmor.net/index.php/ProfileLanguage)
## API Reference
**Pod Annotation**:
Specifying the profile a container will run with:
- **key**: `container.apparmor.security.beta.kubernetes.io/<container_name>`
Where `<container_name>` matches the name of a container in the Pod.
A separate profile can be specified for each container in the Pod.
- **value**: a profile reference, described below
**Profile Reference**:
- `runtime/default`: Refers to the default runtime profile.
- Equivalent to not specifying a profile (without a PodSecurityPolicy default), except it still
requires AppArmor to be enabled.
- For Docker, this resolves to the
[`docker-default`](https://docs.docker.com/engine/security/apparmor/) profile for non-privileged
containers, and unconfined (no profile) for privileged containers.
- `localhost/<profile_name>`: Refers to a profile loaded on the node (localhost) by name.
- The possible profile names are detailed in the
[core policy reference](http://wiki.apparmor.net/index.php/AppArmor_Core_Policy_Reference#Profile_names_and_attachment_specifications)
Any other profile reference format is invalid.
**PodSecurityPolicy Annotations**
Specifying the default profile to apply to containers when none is provided:
- **key**: `apparmor.security.beta.kubernetes.io/defaultProfileName`
- **value**: a profile reference, described above
Specifying the list of profiles Pod containers is allowed to specify:
- **key**: `apparmor.security.beta.kubernetes.io/allowedProfileNames`
- **value**: a comma-separated list of profile references (described above)
- Although an escaped comma is a legal character in a profile name, it cannot be explicitly
allowed here

View File

@ -0,0 +1,10 @@
#include <tunables/global>
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
# Deny all file writes.
deny /** w,
}

View File

@ -0,0 +1,13 @@
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor
annotations:
# Tell Kubernetes to apply the AppArmor profile "k8s-apparmor-example-deny-write".
# Note that this is ignored if the Kubernetes node is not running version 1.4 or greater.
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
spec:
containers:
- name: hello
image: busybox
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]

View File

@ -0,0 +1,208 @@
---
assignees:
- madhusudancs
title: Setting up Cluster Federation with Kubefed
---
* TOC
{:toc}
Kubernetes version 1.5 includes a new command line tool called
`kubefed` to help you administrate your federated clusters.
`kubefed` helps you to deploy a new Kubernetes cluster federation
control plane, and to add clusters to or remove clusters from an
existing federation control plane.
This guide explains how to administer a Kubernetes Cluster Federation
using `kubefed`.
> Note: `kubefed` is an alpha feature in Kubernetes 1.5.
## Prerequisites
This guide assumes that you have a running Kubernetes cluster. Please
see one of the [getting started](/docs/getting-started-guides/) guides
for installation instructions for your platform.
## Getting `kubefed`
Download the client tarball corresponding to Kubernetes version 1.5
or later
[from the release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md),
extract the binaries in the tarball to one of the directories
in your `$PATH` and set the executable permission on those binaries.
Note: The URL in the curl command below downloads the binaries for
Linux amd64. If you are on a different platform, please use the URL
for the binaries appropriate for your platform. You can find the list
of available binaries on the [release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#client-binaries-1).
```shell
curl -O https://storage.googleapis.com/kubernetes-release/release/v1.5.2/kubernetes-client-linux-amd64.tar.gz
tar -xzvf kubernetes-client-linux-amd64.tar.gz
sudo cp kubernetes/client/bin/kubefed /usr/local/bin
sudo chmod +x /usr/local/bin/kubefed
sudo cp kubernetes/client/bin/kubectl /usr/local/bin
sudo chmod +x /usr/local/bin/kubectl
```
## Choosing a host cluster.
You'll need to choose one of your Kubernetes clusters to be the
*host cluster*. The host cluster hosts the components that make up
your federation control plane. Ensure that you have a `kubeconfig`
entry in your local `kubeconfig` that corresponds to the host cluster.
You can verify that you have the required `kubeconfig` entry by
running:
```shell
kubectl config get-contexts
```
The output should contain an entry corresponding to your host cluster,
similar to the following:
```
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1
```
You'll need to provide the `kubeconfig` context (called name in the
entry above) for your host cluster when you deploy your federation
control plane.
## Deploying a federation control plane.
To deploy a federation control plane on your host cluster, run
`kubefed init` command. When you use `kubefed init`, you must provide
the following:
* Federation name
* `--host-cluster-context`, the `kubeconfig` context for the host cluster
* `--dns-zone-name`, a domain name suffix for your federated services
The following example command deploys a federation control plane with
the name `fellowship`, a host cluster context `rivendell`, and the
domain suffix `example.com`:
```shell
kubefed init fellowship --host-cluster-context=rivendell --dns-zone-name="example.com"
```
The domain suffix specified in `--dns-zone-name` must be an existing
domain that you control, and that is programmable by your DNS provider.
`kubefed init` sets up the federation control plane in the host
cluster and also adds an entry for the federation API server in your
local kubeconfig. Note that in the alpha release in Kubernetes 1.5,
`kubefed init` does not automatically set the current context to the
newly deployed federation. You can set the current context manually by
running:
```shell
kubectl config use-context fellowship
```
where `fellowship` is the name of your federation.
## Adding a cluster to a federation
Once you've deployed a federation control plane, you'll need to make
that control plane aware of the clusters it should manage. You can add
a cluster to your federation by using the `kubefed join` command.
To use `kubefed join`, you'll need to provide the name of the cluster
you want to add to the federation, and the `--host-cluster-context`
for the federation control plane's host cluster.
The following example command adds the cluster `gondor` to the
federation with host cluster `rivendell`:
```
kubefed join gondor --host-cluster-context=rivendell
```
> Note: Kubernetes requires that you manually join clusters to a
federation because the federation control plane manages only those
clusters that it is responsible for managing. Adding a cluster tells
the federation control plane that it is responsible for managing that
cluster.
### Naming rules and customization
The cluster name you supply to `kubefed join` must be a valid RFC 1035
label.
Furthermore, federation control plane requires credentials of the
joined clusters to operate on them. These credentials are obtained
from the local kubeconfig. `kubefed join` uses the cluster name
specified as the argument to look for the cluster's context in the
local kubeconfig. If it fails to find a matching context, it exits
with an error.
This might cause issues in cases where context names for each cluster
in the federation don't follow
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules.
In such cases, you can specify a cluster name that conforms to the
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules
and specify the cluster context using the `--cluster-context` flag.
For example, if context of the cluster your are joining is
`gondor_needs-no_king`, then you can join the cluster by running:
```shell
kubefed join gondor --host-cluster-context=rivendell --cluster-context=gondor_needs-no_king
```
#### Secret name
Cluster credentials required by the federation control plane as
described above are stored as a secret in the host cluster. The name
of the secret is also derived from the cluster name.
However, the name of a secret object in Kubernetes should conform
to the DNS subdomain name specification described in
[RFC 1123](https://tools.ietf.org/html/rfc1123). If this isn't the
case, you can pass the secret name to `kubefed join` using the
`--secret-name` flag. For example, if the cluster name is `noldor` and
the secret name is `11kingdom`, you can join the cluster by
running:
```shell
kubefed join noldor --host-cluster-context=rivendell --secret-name=11kingdom
```
Note: If your cluster name does not conform to the DNS subdomain name
specification, all you need to do is supply the secret name via the
`--secret-name` flag. `kubefed join` automatically creates the secret
for you.
## Removing a cluster from a federation
To remove a cluster from a federation, run the `kubefed unjoin`
command with the cluster name and the federation's
`--host-cluster-context`:
```
kubefed unjoin gondor --host-cluster-context=rivendell
```
## Turning down the federation control plane:
Proper cleanup of federation control plane is not fully implemented in
this alpha release of `kubefed`. However, for the time being, deleting
the federation system namespace should remove all the resources except
the persistent storage volume dynamically provisioned for the
federation control plane's etcd. You can delete the federation
namespace by running the following command:
```
$ kubectl delete ns federation-system
```

View File

@ -1,5 +1,8 @@
---
title: Declarative Management of Kubernetes Objects Using Configuration Files
redirect_from:
- "/docs/concepts/tools/kubectl/object-management-using-declarative-config/"
- "/docs/concepts/tools/kubectl/object-management-using-declarative-config.html"
---
{% capture overview %}
@ -949,8 +952,8 @@ The recommended approach for ThirdPartyResources is to use [imperative object co
{% endcapture %}
{% capture whatsnext %}
- [Managing Kubernetes Objects Using Imperative Commands](/docs/concepts/tools/kubectl/object-management-using-imperative-commands/)
- [Imperative Management of Kubernetes Objects Using Configuration Files](/docs/concepts/tools/kubectl/object-management-using-imperative-config/)
- [Managing Kubernetes Objects Using Imperative Commands](/docs/tutorials/object-management-kubectl/imperative-object-management-command/)
- [Imperative Management of Kubernetes Objects Using Configuration Files](/docs/tutorials/object-management-kubectl/imperative-object-management-configuration/)
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)
{% endcapture %}

View File

@ -1,5 +1,8 @@
---
title: Managing Kubernetes Objects Using Imperative Commands
redirect_from:
- "/docs/concepts/tools/kubectl/object-management-using-imperative-commands/"
- "/docs/concepts/tools/kubectl/object-management-using-imperative-commands.html"
---
{% capture overview %}
@ -150,8 +153,8 @@ kubectl create --edit -f /tmp/srv.yaml
{% endcapture %}
{% capture whatsnext %}
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/concepts/tools/kubectl/object-management-using-imperative-config/)
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/concepts/tools/kubectl/object-management-using-declarative-config/)
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/tutorials/object-management-kubectl/imperative-object-management-configuration/)
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/tutorials/object-management-kubectl/declarative-object-management-configuration/)
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)
{% endcapture %}

View File

@ -1,5 +1,8 @@
---
title: Imperative Management of Kubernetes Objects Using Configuration Files
redirect_from:
- "/docs/concepts/tools/kubectl/object-management-using-imperative-config/"
- "/docs/concepts/tools/kubectl/object-management-using-imperative-config.html"
---
{% capture overview %}
@ -120,8 +123,8 @@ template:
{% endcapture %}
{% capture whatsnext %}
- [Managing Kubernetes Objects Using Imperative Commands](/docs/concepts/tools/kubectl/object-management-using-imperative-commands/)
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/concepts/tools/kubectl/object-management-using-declarative-config/)
- [Managing Kubernetes Objects Using Imperative Commands](/docs/tutorials/object-management-kubectl/imperative-object-management-command/)
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/tutorials/object-management-kubectl/declarative-object-management-configuration/)
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)
{% endcapture %}

View File

@ -1,5 +1,8 @@
---
title: Kubernetes Object Management
redirect_from:
- "/docs/concepts/tools/kubectl/object-management-overview/"
- "/docs/concepts/tools/kubectl/object-management-overview.html"
---
{% capture overview %}
@ -162,9 +165,9 @@ Disadvantages compared to imperative object configuration:
{% endcapture %}
{% capture whatsnext %}
- [Managing Kubernetes Objects Using Imperative Commands](/docs/concepts/tools/kubectl/object-management-using-imperative-commands/)
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/concepts/tools/kubectl/object-management-using-imperative-config/)
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/concepts/tools/kubectl/object-management-using-declarative-config/)
- [Managing Kubernetes Objects Using Imperative Commands](/docs/tutorials/object-management-kubectl/imperative-object-management-command/)
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/tutorials/object-management-kubectl/imperative-object-management-configuration/)
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/tutorials/object-management-kubectl/declarative-object-management-configuration/)
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)

View File

@ -0,0 +1,258 @@
---
assignees:
- bprashanth
title: Run Stateless AP Replication Controller
---
* TOC
{:toc}
A replication controller ensures that a specified number of pod "replicas" are
running at any one time. If there are too many, it will kill some. If there are
too few, it will start more.
## Creating a replication controller
Replication controllers are created with `kubectl create`:
```shell
$ kubectl create -f FILE
```
Where:
* `-f FILE` or `--filename FILE` is a relative path to a
[configuration file](#replication_controller_configuration_file) in
either JSON or YAML format.
You can use the [sample file](#sample_file) below to try a create request.
A successful create request returns the name of the replication controller. To
view more details about the controller, see
[Viewing replication controllers](#viewing_replication_controllers) below.
### Replication controller configuration file
When creating a replication controller, you must point to a configuration file
as the value of the `-f` flag. The configuration
file can be formatted as YAML or as JSON, and supports the following fields:
```json
{
"apiVersion": "v1",
"kind": "ReplicationController",
"metadata": {
"name": "",
"labels": "",
"namespace": ""
},
"spec": {
"replicas": int,
"selector": {
"":""
},
"template": {
"metadata": {
"labels": {
"":""
}
},
"spec": {
// See 'The spec schema' below
}
}
}
}
```
Required fields are:
* `kind`: Always `ReplicationController`.
* `apiVersion`: Currently `v1`.
* `metadata`: An object containing:
* `name`: Required if `generateName` is not specified. The name of this
replication controller. It must be an
[RFC1035](https://www.ietf.org/rfc/rfc1035.txt) compatible value and be
unique within the namespace.
* `labels`: Optional. Labels are arbitrary key:value pairs that can be used
for grouping and targeting by other resources and services.
* `generateName`: Required if `name` is not set. A prefix to use to generate
a unique name. Has the same validation rules as `name`.
* `namespace`: Optional. The namespace of the replication controller.
* `annotations`: Optional. A map of string keys and values that can be used
by external tooling to store and retrieve arbitrary metadata about
objects.
* `spec`: The configuration for this replication controller. It must
contain:
* `replicas`: The number of pods to create and maintain.
* `selector`: A map of key:value pairs assigned to the set of pods that
this replication controller is responsible for managing. **This must**
**match the key:value pairs in the `template`'s `labels` field**.
* `template` contains:
* A `metadata` object with `labels` for the pod.
* The [`spec` schema](#the_spec_schema) that defines the pod
configuration.
### The `spec` schema
The `spec` schema (that is a child of `template`) is described in the locations
below:
* The [`spec` schema](/docs/user-guide/pods/multi-container/#the_spec_schema)
section of the Creating Multi-Container Pods page covers required and
frequently-used fields.
* The entire `spec` schema is documented in the
[Kubernetes API reference](/docs/api-reference/v1/definitions/#_v1_podspec).
### Sample file
The following sample file creates 2 pods, each containing a single container
using the `redis` image. Port 80 on each container is opened. The replication
controller is tagged with the `serving` label. The pods are given the label
`frontend` and the `selector` is set to `frontend`, to indicate that the
controller should manage pods with the `frontend` label.
```json
{
"kind": "ReplicationController",
"apiVersion": "v1",
"metadata": {
"name": "frontend-controller",
"labels": {
"state": "serving"
}
},
"spec": {
"replicas": 2,
"selector": {
"app": "frontend"
},
"template": {
"metadata": {
"labels": {
"app": "frontend"
}
},
"spec": {
"volumes": null,
"containers": [
{
"name": "php-redis",
"image": "redis",
"ports": [
{
"containerPort": 80,
"protocol": "TCP"
}
],
"imagePullPolicy": "IfNotPresent"
}
],
"restartPolicy": "Always",
"dnsPolicy": "ClusterFirst"
}
}
}
}
```
## Updating replication controller pods
See [Rolling Updates](/docs/tasks/run-application/rolling-update-replication-controller/).
## Resizing a replication controller
To increase or decrease the number of pods under a replication controller's
control, use the `kubectl scale` command:
$ kubectl scale rc NAME --replicas=COUNT \
[--current-replicas=COUNT] \
[--resource-version=VERSION]
Tip: You can use the `rc` alias in your commands in place of
`replicationcontroller`.
Required fields are:
* `NAME`: The name of the replication controller to update.
* `--replicas=COUNT`: The desired number of replicas.
Optional fields are:
* `--current-replicas=COUNT`: A precondition for current size. If specified,
the resize will only take place if the current number of replicas matches
this value.
* `--resource-version=VERSION`: A precondition for resource version. If
specified, the resize will only take place if the current replication
controller version matches this value. Versions are specified in the
`labels` field of the replication controller's configuration file, as a
key:value pair with a key of `version`. For example,
`--resource-version test` matches:
"labels": {
"version": "test"
}
## Viewing replication controllers
To list replication controllers on a cluster, use the `kubectl get` command:
```shell
$ kubectl get rc
```
A successful get command returns all replication controllers on the cluster in
the specified or default namespace. For example:
```shell
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
frontend php-redis redis name=frontend 2
```
You can also use `get rc NAME` to return information about a specific
replication controller.
To view detailed information about a specific replication controller, use the
`kubectl describe` command:
```shell
$ kubectl describe rc NAME
```
A successful describe request returns details about the replication controller
including number and status of pods managed, and recent events:
```conf
Name: frontend
Namespace: default
Image(s): gcr.io/google_samples/gb-frontend:v3
Selector: name=frontend
Labels: name=frontend
Replicas: 2 current / 2 desired
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Fri, 06 Nov 2015 16:52:50 -0800 Fri, 06 Nov 2015 16:52:50 -0800 1 {replication-controller } SuccessfulCreate Created pod: frontend-gyx2h
Fri, 06 Nov 2015 16:52:50 -0800 Fri, 06 Nov 2015 16:52:50 -0800 1 {replication-controller } SuccessfulCreate Created pod: frontend-vc9w4
```
## Deleting replication controllers
To delete a replication controller as well as the pods that it controls, use
`kubectl delete`:
```shell
$ kubectl delete rc NAME
```
By default, `kubectl delete rc` will resize the controller to zero (effectively
deleting all pods) before deleting it.
To delete a replication controller without deleting its pods, use
`kubectl delete` and specify `--cascade=false`:
```shell
$ kubectl delete rc NAME --cascade=false
```
A successful delete request returns the name of the deleted resource.

View File

@ -1,119 +1,7 @@
---
assignees:
- mikedanese
title: Best Practices for Configuration
---
This document is meant to highlight and consolidate in one place configuration best practices that are introduced throughout the user-guide and getting-started documentation and examples. This is a living document so if you think of something that is not on this list but might be useful to others, please don't hesitate to file an issue or submit a PR.
## General Config Tips
- When defining configurations, specify the latest stable API version (currently v1).
- Configuration files should be stored in version control before being pushed to the cluster. This allows a configuration to be quickly rolled back if needed, and will aid with cluster re-creation and restoration if necessary.
- Write your configuration files using YAML rather than JSON. They can be used interchangeably in almost all scenarios, but YAML tends to be more user-friendly for config.
- Group related objects together in a single file where this makes sense. This format is often easier to manage than separate files. See the [guestbook-all-in-one.yaml](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/all-in-one/guestbook-all-in-one.yaml) file as an example of this syntax.
(Note also that many `kubectl` commands can be called on a directory, and so you can also call
`kubectl create` on a directory of config files— see below for more detail).
- Don't specify default values unnecessarily, in order to simplify and minimize configs, and to
reduce error. For example, omit the selector and labels in a `ReplicationController` if you want
them to be the same as the labels in its `podTemplate`, since those fields are populated from the
`podTemplate` labels by default. See the [guestbook app's](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) .yaml files for some [examples](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/frontend-deployment.yaml) of this.
- Put an object description in an annotation to allow better introspection.
## "Naked" Pods vs Replication Controllers and Jobs
- If there is a viable alternative to naked pods (i.e., pods not bound to a [replication controller
](/docs/user-guide/replication-controller)), go with the alternative. Naked pods will not be rescheduled in the
event of node failure.
Replication controllers are almost always preferable to creating pods, except for some explicit
[`restartPolicy: Never`](/docs/user-guide/pod-states/#restartpolicy) scenarios. A
[Job](/docs/user-guide/jobs/) object (currently in Beta), may also be appropriate.
## Services
- It's typically best to create a [service](/docs/user-guide/services/) before corresponding [replication
controllers](/docs/user-guide/replication-controller/), so that the scheduler can spread the pods comprising the
service. You can also create a replication controller without specifying replicas (this will set
replicas=1), create a service, then scale up the replication controller. This can be useful in
ensuring that one replica works before creating lots of them.
- Don't use `hostPort` (which specifies the port number to expose on the host) unless absolutely
necessary, e.g., for a node daemon. When you bind a Pod to a `hostPort`, there are a limited
number of places that pod can be scheduled, due to port conflicts— you can only schedule as many
such Pods as there are nodes in your Kubernetes cluster.
If you only need access to the port for debugging purposes, you can use the [kubectl proxy and apiserver proxy](/docs/user-guide/connecting-to-applications-proxy/) or [kubectl port-forward](/docs/user-guide/connecting-to-applications-port-forward/).
You can use a [Service](/docs/user-guide/services/) object for external service access.
If you do need to expose a pod's port on the host machine, consider using a [NodePort](/docs/user-guide/services/#type-nodeport) service before resorting to `hostPort`.
- Avoid using `hostNetwork`, for the same reasons as `hostPort`.
- Use _headless services_ for easy service discovery when you don't need kube-proxy load balancing.
See [headless services](/docs/user-guide/services/#headless-services).
## Using Labels
- Define and use [labels](/docs/user-guide/labels/) that identify __semantic attributes__ of your application or
deployment. For example, instead of attaching a label to a set of pods to explicitly represent
some service (e.g., `service: myservice`), or explicitly representing the replication
controller managing the pods (e.g., `controller: mycontroller`), attach labels that identify
semantic attributes, such as `{ app: myapp, tier: frontend, phase: test, deployment: v3 }`. This
will let you select the object groups appropriate to the context— e.g., a service for all "tier:
frontend" pods, or all "test" phase components of app "myapp". See the
[guestbook](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) app for an example of this approach.
A service can be made to span multiple deployments, such as is done across [rolling updates](/docs/user-guide/kubectl/kubectl_rolling-update/), by simply omitting release-specific labels from its selector, rather than updating a service's selector to match the replication controller's selector fully.
- To facilitate rolling updates, include version info in replication controller names, e.g. as a
suffix to the name. It is useful to set a 'version' label as well. The rolling update creates a
new controller as opposed to modifying the existing controller. So, there will be issues with
version-agnostic controller names. See the [documentation](/docs/user-guide/kubectl/kubectl_rolling-update/) on
the rolling-update command for more detail.
Note that the [Deployment](/docs/user-guide/deployments/) object obviates the need to manage replication
controller 'version names'. A desired state of an object is described by a Deployment, and if
changes to that spec are _applied_, the deployment controller changes the actual state to the
desired state at a controlled rate. (Deployment objects are currently part of the [`extensions`
API Group](/docs/api/#api-groups).)
- You can manipulate labels for debugging. Because Kubernetes replication controllers and services
match to pods using labels, this allows you to remove a pod from being considered by a
controller, or served traffic by a service, by removing the relevant selector labels. If you
remove the labels of an existing pod, its controller will create a new pod to take its place.
This is a useful way to debug a previously "live" pod in a quarantine environment. See the
[`kubectl label`](/docs/user-guide/kubectl/kubectl_label/) command.
## Container Images
- The [default container image pull policy](/docs/user-guide/images/) is `IfNotPresent`, which causes the
[Kubelet](/docs/admin/kubelet/) to not pull an image if it already exists. If you would like to
always force a pull, you must specify a pull image policy of `Always` in your .yaml file
(`imagePullPolicy: Always`) or specify a `:latest` tag on your image.
That is, if you're specifying an image with other than the `:latest` tag, e.g. `myimage:v1`, and
there is an image update to that same tag, the Kubelet won't pull the updated image. You can
address this by ensuring that any updates to an image bump the image tag as well (e.g.
`myimage:v2`), and ensuring that your configs point to the correct version.
**Note:** you should avoid using `:latest` tag when deploying containers in production, because this makes it hard
to track which version of the image is running and hard to roll back.
## Using kubectl
- Use `kubectl create -f <directory>` where possible. This looks for config objects in all `.yaml`, `.yml`, and `.json` files in `<directory>` and passes them to `create`.
- Use `kubectl delete` rather than `stop`. `Delete` has a superset of the functionality of `stop`, and `stop` is deprecated.
- Use kubectl bulk operations (via files and/or labels) for get and delete. See [label selectors](/docs/user-guide/labels/#label-selectors) and [using labels effectively](/docs/user-guide/managing-deployments/#using-labels-effectively).
- Use `kubectl run` and `expose` to quickly create and expose single container Deployments. See the [quick start guide](/docs/user-guide/quick-start/) for an example.
{% include user-guide-content-moved.md %}
[Configuration Overview](/docs/concepts/configuration/overview/)

Some files were not shown because too many files have changed in this diff Show More