Initial checkin of v1.1 -- does not build
20
_config.yml
|
@ -22,24 +22,4 @@ defaults:
|
|||
values:
|
||||
version: "v1.1"
|
||||
versionfilesafe: "v1_1"
|
||||
-
|
||||
scope:
|
||||
path: "v1.1/reference"
|
||||
values:
|
||||
section: "reference"
|
||||
-
|
||||
scope:
|
||||
path: "v1.1/guides"
|
||||
values:
|
||||
section: "guides"
|
||||
-
|
||||
scope:
|
||||
path: "v1.1/support"
|
||||
values:
|
||||
section: "support"
|
||||
-
|
||||
scope:
|
||||
path: "v1.1/samples"
|
||||
values:
|
||||
section: "samples"
|
||||
permalink: pretty
|
|
@ -0,0 +1,19 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes API Reference"
|
||||
---
|
||||
|
||||
## {{ page.title }} ##
|
||||
|
||||
Use these reference documents to learn how to interact with Kubernetes through the REST API.
|
||||
|
||||
You can also view details about the *Extensions API*. For more about extensions, see [API versioning](docs/api.html).
|
||||
|
||||
<p>Table of Contents:</p>
|
||||
<ul id="toclist"></ul>
|
||||
|
||||
<script>
|
||||
$(function() {
|
||||
$('#toclist').load( location.pathname + " #gentocapiref li" );
|
||||
});
|
||||
</script>
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Application Administration: Detailed Walkthrough"
|
||||
---
|
||||
|
||||
## {{ page.title }} ##
|
||||
|
||||
The detailed walkthrough covers all the in-depth details and tasks for administering your applications in Kubernetes.
|
||||
|
||||
<p>Table of Contents:</p>
|
||||
<ul id="toclist"></ul>
|
||||
|
||||
<script>
|
||||
$(function() {
|
||||
$('#toclist').load( location.pathname + " #gentocappadmin li" );
|
||||
});
|
||||
</script>
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Quick Walkthrough: Kubernetes Basics"
|
||||
---
|
||||
|
||||
## {{ page.title }} ##
|
||||
|
||||
Use this quick walkthrough of Kubernetes to learn about the basic application administration tasks.
|
||||
|
||||
<p>Table of Contents:</p>
|
||||
<ul id="toclist"></ul>
|
||||
|
||||
<script>
|
||||
$(function() {
|
||||
$('#toclist').load( location.pathname + " #gentocbasictut li" );
|
||||
});
|
||||
</script>
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Examples: Deploying Clusters"
|
||||
---
|
||||
|
||||
## {{ page.title }} ##
|
||||
|
||||
Use the following examples to learn how to deploy your application into a Kubernetes cluster.
|
||||
|
||||
<p>Table of Contents:</p>
|
||||
<ul id="toclist"></ul>
|
||||
|
||||
<script>
|
||||
$(function() {
|
||||
$('#toclist').load( location.pathname + " #gentocdplyclst li" );
|
||||
});
|
||||
</script>
|
|
@ -0,0 +1,49 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Documentation: releases.k8s.io/release-1.1"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Documentation: releases.k8s.io/release-1.1
|
||||
|
||||
* The [User's guide](user-guide/README.html) is for anyone who wants to run programs and
|
||||
services on an existing Kubernetes cluster.
|
||||
|
||||
* The [Cluster Admin's guide](admin/README.html) is for anyone setting up
|
||||
a Kubernetes cluster or administering it.
|
||||
|
||||
* The [Developer guide](devel/README.html) is for anyone wanting to write
|
||||
programs that access the Kubernetes API, write plugins or extensions, or
|
||||
modify the core code of Kubernetes.
|
||||
|
||||
* The [Kubectl Command Line Interface](user-guide/kubectl/kubectl.html) is a detailed reference on
|
||||
the `kubectl` CLI.
|
||||
|
||||
* The [API object documentation](http://kubernetes.io/third_party/swagger-ui/)
|
||||
is a detailed description of all fields found in core API objects.
|
||||
|
||||
* An overview of the [Design of Kubernetes](design/)
|
||||
|
||||
* There are example files and walkthroughs in the [examples](../examples/)
|
||||
folder.
|
||||
|
||||
* If something went wrong, see the [troubleshooting](troubleshooting.html) document for how to debug.
|
||||
You should also check the [known issues](user-guide/known-issues.html) for the release you're using.
|
||||
|
||||
* To report a security issue, see [Reporting a Security Issue](reporting-security-issues.html).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Cluster Admin Guide"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Cluster Admin Guide
|
||||
|
||||
The cluster admin guide is for anyone creating or administering a Kubernetes cluster.
|
||||
It assumes some familiarity with concepts in the [User Guide](../user-guide/README.html).
|
||||
|
||||
## Admin Guide Table of Contents
|
||||
|
||||
[Introduction](introduction.html)
|
||||
|
||||
1. [Components of a cluster](cluster-components.html)
|
||||
1. [Cluster Management](cluster-management.html)
|
||||
1. Administrating Master Components
|
||||
1. [The kube-apiserver binary](kube-apiserver.html)
|
||||
1. [Authorization](authorization.html)
|
||||
1. [Authentication](authentication.html)
|
||||
1. [Accessing the api](accessing-the-api.html)
|
||||
1. [Admission Controllers](admission-controllers.html)
|
||||
1. [Administrating Service Accounts](service-accounts-admin.html)
|
||||
1. [Resource Quotas](resource-quota.html)
|
||||
1. [The kube-scheduler binary](kube-scheduler.html)
|
||||
1. [The kube-controller-manager binary](kube-controller-manager.html)
|
||||
1. [Administrating Kubernetes Nodes](node.html)
|
||||
1. [The kubelet binary](kubelet.html)
|
||||
1. [Garbage Collection](garbage-collection.html)
|
||||
1. [The kube-proxy binary](kube-proxy.html)
|
||||
1. Administrating Addons
|
||||
1. [DNS](dns.html)
|
||||
1. [Networking](networking.html)
|
||||
1. [OVS Networking](ovs-networking.html)
|
||||
1. Example Configurations
|
||||
1. [Multiple Clusters](multi-cluster.html)
|
||||
1. [High Availability Clusters](high-availability.html)
|
||||
1. [Large Clusters](cluster-large.html)
|
||||
1. [Getting started from scratch](../getting-started-guides/scratch.html)
|
||||
1. [Kubernetes's use of salt](salt.html)
|
||||
1. [Troubleshooting](cluster-troubleshooting.html)
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Configuring APIserver ports"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Configuring APIserver ports
|
||||
|
||||
This document describes what ports the Kubernetes apiserver
|
||||
may serve on and how to reach them. The audience is
|
||||
cluster administrators who want to customize their cluster
|
||||
or understand the details.
|
||||
|
||||
Most questions about accessing the cluster are covered
|
||||
in [Accessing the cluster](../user-guide/accessing-the-cluster.html).
|
||||
|
||||
|
||||
## Ports and IPs Served On
|
||||
|
||||
The Kubernetes API is served by the Kubernetes apiserver process. Typically,
|
||||
there is one of these running on a single kubernetes-master node.
|
||||
|
||||
By default the Kubernetes APIserver serves HTTP on 2 ports:
|
||||
1. Localhost Port
|
||||
- serves HTTP
|
||||
- default is port 8080, change with `--insecure-port` flag.
|
||||
- defaults IP is localhost, change with `--insecure-bind-address` flag.
|
||||
- no authentication or authorization checks in HTTP
|
||||
- protected by need to have host access
|
||||
2. Secure Port
|
||||
- default is port 6443, change with `--secure-port` flag.
|
||||
- default IP is first non-localhost network interface, change with `--bind-address` flag.
|
||||
- serves HTTPS. Set cert with `--tls-cert-file` and key with `--tls-private-key-file` flag.
|
||||
- uses token-file or client-certificate based [authentication](authentication.html).
|
||||
- uses policy-based [authorization](authorization.html).
|
||||
3. Removed: ReadOnly Port
|
||||
- For security reasons, this had to be removed. Use the [service account](../user-guide/service-accounts.html) feature instead.
|
||||
|
||||
## Proxies and Firewall rules
|
||||
|
||||
Additionally, in some configurations there is a proxy (nginx) running
|
||||
on the same machine as the apiserver process. The proxy serves HTTPS protected
|
||||
by Basic Auth on port 443, and proxies to the apiserver on localhost:8080. In
|
||||
these configurations the secure port is typically set to 6443.
|
||||
|
||||
A firewall rule is typically configured to allow external HTTPS access to port 443.
|
||||
|
||||
The above are defaults and reflect how Kubernetes is deployed to Google Compute Engine using
|
||||
kube-up.sh. Other cloud providers may vary.
|
||||
|
||||
## Use Cases vs IP:Ports
|
||||
|
||||
There are three differently configured serving ports because there are a
|
||||
variety of uses cases:
|
||||
1. Clients outside of a Kubernetes cluster, such as human running `kubectl`
|
||||
on desktop machine. Currently, accesses the Localhost Port via a proxy (nginx)
|
||||
running on the `kubernetes-master` machine. The proxy can use cert-based authentication
|
||||
or token-based authentication.
|
||||
2. Processes running in Containers on Kubernetes that need to read from
|
||||
the apiserver. Currently, these can use a [service account](../user-guide/service-accounts.html).
|
||||
3. Scheduler and Controller-manager processes, which need to do read-write
|
||||
API operations. Currently, these have to run on the same host as the
|
||||
apiserver and use the Localhost Port. In the future, these will be
|
||||
switched to using service accounts to avoid the need to be co-located.
|
||||
4. Kubelets, which need to do read-write API operations and are necessarily
|
||||
on different machines than the apiserver. Kubelet uses the Secure Port
|
||||
to get their pods, to find the services that a pod can see, and to
|
||||
write events. Credentials are distributed to kubelets at cluster
|
||||
setup time. Kubelet and kube-proxy can use cert-based authentication or token-based
|
||||
authentication.
|
||||
|
||||
## Expected changes
|
||||
|
||||
- Policy will limit the actions kubelets can do via the authed port.
|
||||
- Scheduler and Controller-manager will use the Secure Port too. They
|
||||
will then be able to run on different machines than the apiserver.
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/accessing-the-api.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,177 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Admission Controllers"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Admission Controllers
|
||||
|
||||
**Table of Contents**
|
||||
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||
|
||||
- [Admission Controllers](#admission-controllers)
|
||||
- [What are they?](#what-are-they)
|
||||
- [Why do I need them?](#why-do-i-need-them)
|
||||
- [How do I turn on an admission control plug-in?](#how-do-i-turn-on-an-admission-control-plug-in)
|
||||
- [What does each plug-in do?](#what-does-each-plug-in-do)
|
||||
- [AlwaysAdmit](#alwaysadmit)
|
||||
- [AlwaysDeny](#alwaysdeny)
|
||||
- [DenyExecOnPrivileged (deprecated)](#denyexeconprivileged-deprecated)
|
||||
- [DenyEscalatingExec](#denyescalatingexec)
|
||||
- [ServiceAccount](#serviceaccount)
|
||||
- [SecurityContextDeny](#securitycontextdeny)
|
||||
- [ResourceQuota](#resourcequota)
|
||||
- [LimitRanger](#limitranger)
|
||||
- [InitialResources (experimental)](#initialresources-experimental)
|
||||
- [NamespaceExists (deprecated)](#namespaceexists-deprecated)
|
||||
- [NamespaceAutoProvision (deprecated)](#namespaceautoprovision-deprecated)
|
||||
- [NamespaceLifecycle](#namespacelifecycle)
|
||||
- [Is there a recommended set of plug-ins to use?](#is-there-a-recommended-set-of-plug-ins-to-use)
|
||||
|
||||
<!-- END MUNGE: GENERATED_TOC -->
|
||||
|
||||
## What are they?
|
||||
|
||||
An admission control plug-in is a piece of code that intercepts requests to the Kubernetes
|
||||
API server prior to persistence of the object, but after the request is authenticated
|
||||
and authorized. The plug-in code is in the API server process
|
||||
and must be compiled into the binary in order to be used at this time.
|
||||
|
||||
Each admission control plug-in is run in sequence before a request is accepted into the cluster. If
|
||||
any of the plug-ins in the sequence reject the request, the entire request is rejected immediately
|
||||
and an error is returned to the end-user.
|
||||
|
||||
Admission control plug-ins may mutate the incoming object in some cases to apply system configured
|
||||
defaults. In addition, admission control plug-ins may mutate related resources as part of request
|
||||
processing to do things like increment quota usage.
|
||||
|
||||
## Why do I need them?
|
||||
|
||||
Many advanced features in Kubernetes require an admission control plug-in to be enabled in order
|
||||
to properly support the feature. As a result, a Kubernetes API server that is not properly
|
||||
configured with the right set of admission control plug-ins is an incomplete server and will not
|
||||
support all the features you expect.
|
||||
|
||||
## How do I turn on an admission control plug-in?
|
||||
|
||||
The Kubernetes API server supports a flag, `admission-control` that takes a comma-delimited,
|
||||
ordered list of admission control choices to invoke prior to modifying objects in the cluster.
|
||||
|
||||
## What does each plug-in do?
|
||||
|
||||
### AlwaysAdmit
|
||||
|
||||
Use this plugin by itself to pass-through all requests.
|
||||
|
||||
### AlwaysDeny
|
||||
|
||||
Rejects all requests. Used for testing.
|
||||
|
||||
### DenyExecOnPrivileged (deprecated)
|
||||
|
||||
This plug-in will intercept all requests to exec a command in a pod if that pod has a privileged container.
|
||||
|
||||
If your cluster supports privileged containers, and you want to restrict the ability of end-users to exec
|
||||
commands in those containers, we strongly encourage enabling this plug-in.
|
||||
|
||||
This functionality has been merged into [DenyEscalatingExec](#denyescalatingexec).
|
||||
|
||||
### DenyEscalatingExec
|
||||
|
||||
This plug-in will deny exec and attach commands to pods that run with escalated privileges that
|
||||
allow host access. This includes pods that run as privileged, have access to the host IPC namespace, and
|
||||
have access to the host PID namespace.
|
||||
|
||||
If your cluster supports containers that run with escalated privileges, and you want to
|
||||
restrict the ability of end-users to exec commands in those containers, we strongly encourage
|
||||
enabling this plug-in.
|
||||
|
||||
### ServiceAccount
|
||||
|
||||
This plug-in implements automation for [serviceAccounts](../user-guide/service-accounts.html).
|
||||
We strongly recommend using this plug-in if you intend to make use of Kubernetes `ServiceAccount` objects.
|
||||
|
||||
### SecurityContextDeny
|
||||
|
||||
This plug-in will deny any pod with a [SecurityContext](../user-guide/security-context.html) that defines options that were not available on the `Container`.
|
||||
|
||||
### ResourceQuota
|
||||
|
||||
This plug-in will observe the incoming request and ensure that it does not violate any of the constraints
|
||||
enumerated in the `ResourceQuota` object in a `Namespace`. If you are using `ResourceQuota`
|
||||
objects in your Kubernetes deployment, you MUST use this plug-in to enforce quota constraints.
|
||||
|
||||
See the [resourceQuota design doc](../design/admission_control_resource_quota.html) and the [example of Resource Quota](resourcequota/) for more details.
|
||||
|
||||
It is strongly encouraged that this plug-in is configured last in the sequence of admission control plug-ins. This is
|
||||
so that quota is not prematurely incremented only for the request to be rejected later in admission control.
|
||||
|
||||
### LimitRanger
|
||||
|
||||
This plug-in will observe the incoming request and ensure that it does not violate any of the constraints
|
||||
enumerated in the `LimitRange` object in a `Namespace`. If you are using `LimitRange` objects in
|
||||
your Kubernetes deployment, you MUST use this plug-in to enforce those constraints. LimitRanger can also
|
||||
be used to apply default resource requests to Pods that don't specify any; currently, the default LimitRanger
|
||||
applies a 0.1 CPU requirement to all Pods in the `default` namespace.
|
||||
|
||||
See the [limitRange design doc](../design/admission_control_limit_range.html) and the [example of Limit Range](limitrange/) for more details.
|
||||
|
||||
### InitialResources (experimental)
|
||||
|
||||
This plug-in observes pod creation requests. If a container omits compute resource requests and limits,
|
||||
then the plug-in auto-populates a compute resource request based on historical usage of containers running the same image.
|
||||
If there is not enough data to make a decision the Request is left unchanged.
|
||||
When the plug-in sets a compute resource request, it annotates the pod with information on what compute resources it auto-populated.
|
||||
|
||||
See the [InitialResouces proposal](../proposals/initial-resources.html) for more details.
|
||||
|
||||
### NamespaceExists (deprecated)
|
||||
|
||||
This plug-in will observe all incoming requests that attempt to create a resource in a Kubernetes `Namespace`
|
||||
and reject the request if the `Namespace` was not previously created. We strongly recommend running
|
||||
this plug-in to ensure integrity of your data.
|
||||
|
||||
The functionality of this admission controller has been merged into `NamespaceLifecycle`
|
||||
|
||||
### NamespaceAutoProvision (deprecated)
|
||||
|
||||
This plug-in will observe all incoming requests that attempt to create a resource in a Kubernetes `Namespace`
|
||||
and create a new `Namespace` if one did not already exist previously.
|
||||
|
||||
We strongly recommend `NamespaceLifecycle` over `NamespaceAutoProvision`.
|
||||
|
||||
### NamespaceLifecycle
|
||||
|
||||
This plug-in enforces that a `Namespace` that is undergoing termination cannot have new objects created in it,
|
||||
and ensures that requests in a non-existant `Namespace` are rejected.
|
||||
|
||||
A `Namespace` deletion kicks off a sequence of operations that remove all objects (pods, services, etc.) in that
|
||||
namespace. In order to enforce integrity of that process, we strongly recommend running this plug-in.
|
||||
|
||||
## Is there a recommended set of plug-ins to use?
|
||||
|
||||
Yes.
|
||||
|
||||
For Kubernetes 1.0, we strongly recommend running the following set of admission control plug-ins (order matters):
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
--admission-control=NamespaceLifecycle,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/admission-controllers.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,146 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Authentication Plugins"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Authentication Plugins
|
||||
|
||||
Kubernetes uses client certificates, tokens, or http basic auth to authenticate users for API calls.
|
||||
|
||||
**Client certificate authentication** is enabled by passing the `--client-ca-file=SOMEFILE`
|
||||
option to apiserver. The referenced file must contain one or more certificates authorities
|
||||
to use to validate client certificates presented to the apiserver. If a client certificate
|
||||
is presented and verified, the common name of the subject is used as the user name for the
|
||||
request.
|
||||
|
||||
**Token File** is enabled by passing the `--token-auth-file=SOMEFILE` option
|
||||
to apiserver. Currently, tokens last indefinitely, and the token list cannot
|
||||
be changed without restarting apiserver.
|
||||
|
||||
The token file format is implemented in `plugin/pkg/auth/authenticator/token/tokenfile/...`
|
||||
and is a csv file with 3 columns: token, user name, user uid.
|
||||
|
||||
When using token authentication from an http client the apiserver expects an `Authorization`
|
||||
header with a value of `Bearer SOMETOKEN`.
|
||||
|
||||
**OpenID Connect ID Token** is enabled by passing the following options to the apiserver:
|
||||
- `--oidc-issuer-url` (required) tells the apiserver where to connect to the OpenID provider. Only HTTPS scheme will be accepted.
|
||||
- `--oidc-client-id` (required) is used by apiserver to verify the audience of the token.
|
||||
A valid [ID token](http://openid.net/specs/openid-connect-core-1_0.html#IDToken) MUST have this
|
||||
client-id in its `aud` claims.
|
||||
- `--oidc-ca-file` (optional) is used by apiserver to establish and verify the secure connection
|
||||
to the OpenID provider.
|
||||
- `--oidc-username-claim` (optional, experimental) specifies which OpenID claim to use as the user name. By default, `sub`
|
||||
will be used, which should be unique and immutable under the issuer's domain. Cluster administrator can
|
||||
choose other claims such as `email` to use as the user name, but the uniqueness and immutability is not guaranteed.
|
||||
|
||||
Please note that this flag is still experimental until we settle more on how to handle the mapping of the OpenID user to the Kubernetes user. Thus further changes are possible.
|
||||
|
||||
Currently, the ID token will be obtained by some third-party app. This means the app and apiserver
|
||||
MUST share the `--oidc-client-id`.
|
||||
|
||||
Like **Token File**, when using token authentication from an http client the apiserver expects
|
||||
an `Authorization` header with a value of `Bearer SOMETOKEN`.
|
||||
|
||||
**Basic authentication** is enabled by passing the `--basic-auth-file=SOMEFILE`
|
||||
option to apiserver. Currently, the basic auth credentials last indefinitely,
|
||||
and the password cannot be changed without restarting apiserver. Note that basic
|
||||
authentication is currently supported for convenience while we finish making the
|
||||
more secure modes described above easier to use.
|
||||
|
||||
The basic auth file format is implemented in `plugin/pkg/auth/authenticator/password/passwordfile/...`
|
||||
and is a csv file with 3 columns: password, user name, user id.
|
||||
|
||||
When using basic authentication from an http client, the apiserver expects an `Authorization` header
|
||||
with a value of `Basic BASE64ENCODED(USER:PASSWORD)`.
|
||||
|
||||
**Keystone authentication** is enabled by passing the `--experimental-keystone-url=<AuthURL>`
|
||||
option to the apiserver during startup. The plugin is implemented in
|
||||
`plugin/pkg/auth/authenticator/request/keystone/keystone.go`.
|
||||
For details on how to use keystone to manage projects and users, refer to the
|
||||
[Keystone documentation](http://docs.openstack.org/developer/keystone/). Please note that
|
||||
this plugin is still experimental which means it is subject to changes.
|
||||
Please refer to the [discussion](https://github.com/kubernetes/kubernetes/pull/11798#issuecomment-129655212)
|
||||
and the [blueprint](https://github.com/kubernetes/kubernetes/issues/11626) for more details
|
||||
|
||||
## Plugin Development
|
||||
|
||||
We plan for the Kubernetes API server to issue tokens
|
||||
after the user has been (re)authenticated by a *bedrock* authentication
|
||||
provider external to Kubernetes. We plan to make it easy to develop modules
|
||||
that interface between Kubernetes and a bedrock authentication provider (e.g.
|
||||
github.com, google.com, enterprise directory, kerberos, etc.)
|
||||
|
||||
## APPENDIX
|
||||
|
||||
### Creating Certificates
|
||||
|
||||
When using client certificate authentication, you can generate certificates manually or
|
||||
using an existing deployment script.
|
||||
|
||||
**Deployment script** is implemented at
|
||||
`cluster/saltbase/salt/generate-cert/make-ca-cert.sh`.
|
||||
Execute this script with two parameters. First is the IP address of apiserver, the second is
|
||||
a list of subject alternate names in the form `IP:<ip-address> or DNS:<dns-name>`.
|
||||
The script will generate three files:ca.crt, server.crt and server.key.
|
||||
Finally, add these parameters
|
||||
`--client-ca-file=/srv/kubernetes/ca.crt`
|
||||
`--tls-cert-file=/srv/kubernetes/server.cert`
|
||||
`--tls-private-key-file=/srv/kubernetes/server.key`
|
||||
into apiserver start parameters.
|
||||
|
||||
**easyrsa** can be used to manually generate certificates for your cluster.
|
||||
|
||||
1. Download, unpack, and initialize the patched version of easyrsa3.
|
||||
|
||||
curl -L -O https://storage.googleapis.com/kubernetes-release/easy-rsa/easy-rsa.tar.gz
|
||||
tar xzf easy-rsa.tar.gz
|
||||
cd easy-rsa-master/easyrsa3
|
||||
./easyrsa init-pki
|
||||
1. Generate a CA. (`--batch` set automatic mode. `--req-cn` default CN to use.)
|
||||
|
||||
./easyrsa --batch "--req-cn=${MASTER_IP}@`date +%s`" build-ca nopass
|
||||
1. Generate server certificate and key.
|
||||
(build-server-full [filename]: Generate a keypair and sign locally for a client or server)
|
||||
|
||||
./easyrsa --subject-alt-name="IP:${MASTER_IP}" build-server-full kubernetes-master nopass
|
||||
1. Copy `pki/ca.crt` `pki/issued/kubernetes-master.crt`
|
||||
`pki/private/kubernetes-master.key` to your directory.
|
||||
1. Remember fill the parameters
|
||||
`--client-ca-file=/yourdirectory/ca.crt`
|
||||
`--tls-cert-file=/yourdirectory/server.cert`
|
||||
`--tls-private-key-file=/yourdirectory/server.key`
|
||||
and add these into apiserver start parameters.
|
||||
|
||||
**openssl** can also be use to manually generate certificates for your cluster.
|
||||
|
||||
1. Generate a ca.key with 2048bit
|
||||
`openssl genrsa -out ca.key 2048`
|
||||
1. According to the ca.key generate a ca.crt. (-days set the certificate effective time).
|
||||
`openssl req -x509 -new -nodes -key ca.key -subj "/CN=${MASTER_IP}" -days 10000 -out ca.crt`
|
||||
1. Generate a server.key with 2048bit
|
||||
`openssl genrsa -out server.key 2048`
|
||||
1. According to the server.key generate a server.csr.
|
||||
`openssl req -new -key server.key -subj "/CN=${MASTER_IP}" -out server.csr`
|
||||
1. According to the ca.key, ca.crt and server.csr generate the server.crt.
|
||||
`openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out server.crt
|
||||
-days 10000`
|
||||
1. View the certificate.
|
||||
`openssl x509 -noout -text -in ./server.crt`
|
||||
Finally, do not forget fill the same parameters and add parameters into apiserver start parameters.
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/authentication.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,159 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Authorization Plugins"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Authorization Plugins
|
||||
|
||||
|
||||
In Kubernetes, authorization happens as a separate step from authentication.
|
||||
See the [authentication documentation](authentication.html) for an
|
||||
overview of authentication.
|
||||
|
||||
Authorization applies to all HTTP accesses on the main (secure) apiserver port.
|
||||
|
||||
The authorization check for any request compares attributes of the context of
|
||||
the request, (such as user, resource, and namespace) with access
|
||||
policies. An API call must be allowed by some policy in order to proceed.
|
||||
|
||||
The following implementations are available, and are selected by flag:
|
||||
- `--authorization-mode=AlwaysDeny`
|
||||
- `--authorization-mode=AlwaysAllow`
|
||||
- `--authorization-mode=ABAC`
|
||||
|
||||
`AlwaysDeny` blocks all requests (used in tests).
|
||||
`AlwaysAllow` allows all requests; use if you don't need authorization.
|
||||
`ABAC` allows for user-configured authorization policy. ABAC stands for Attribute-Based Access Control.
|
||||
|
||||
## ABAC Mode
|
||||
|
||||
### Request Attributes
|
||||
|
||||
A request has 5 attributes that can be considered for authorization:
|
||||
- user (the user-string which a user was authenticated as).
|
||||
- group (the list of group names the authenticated user is a member of).
|
||||
- whether the request is readonly (GETs are readonly).
|
||||
- what resource is being accessed.
|
||||
- applies only to the API endpoints, such as
|
||||
`/api/v1/namespaces/default/pods`. For miscellaneous endpoints, like `/version`, the
|
||||
resource is the empty string.
|
||||
- the namespace of the object being access, or the empty string if the
|
||||
endpoint does not support namespaced objects.
|
||||
|
||||
We anticipate adding more attributes to allow finer grained access control and
|
||||
to assist in policy management.
|
||||
|
||||
### Policy File Format
|
||||
|
||||
For mode `ABAC`, also specify `--authorization-policy-file=SOME_FILENAME`.
|
||||
|
||||
The file format is [one JSON object per line](http://jsonlines.org/). There should be no enclosing list or map, just
|
||||
one map per line.
|
||||
|
||||
Each line is a "policy object". A policy object is a map with the following properties:
|
||||
- `user`, type string; the user-string from `--token-auth-file`. If you specify `user`, it must match the username of the authenticated user.
|
||||
- `group`, type string; if you specify `group`, it must match one of the groups of the authenticated user.
|
||||
- `readonly`, type boolean, when true, means that the policy only applies to GET
|
||||
operations.
|
||||
- `resource`, type string; a resource from an URL, such as `pods`.
|
||||
- `namespace`, type string; a namespace string.
|
||||
|
||||
An unset property is the same as a property set to the zero value for its type (e.g. empty string, 0, false).
|
||||
However, unset should be preferred for readability.
|
||||
|
||||
In the future, policies may be expressed in a JSON format, and managed via a REST
|
||||
interface.
|
||||
|
||||
### Authorization Algorithm
|
||||
|
||||
A request has attributes which correspond to the properties of a policy object.
|
||||
|
||||
When a request is received, the attributes are determined. Unknown attributes
|
||||
are set to the zero value of its type (e.g. empty string, 0, false).
|
||||
|
||||
An unset property will match any value of the corresponding
|
||||
attribute. An unset attribute will match any value of the corresponding property.
|
||||
|
||||
The tuple of attributes is checked for a match against every policy in the policy file.
|
||||
If at least one line matches the request attributes, then the request is authorized (but may fail later validation).
|
||||
|
||||
To permit any user to do something, write a policy with the user property unset.
|
||||
To permit an action Policy with an unset namespace applies regardless of namespace.
|
||||
|
||||
### Examples
|
||||
|
||||
1. Alice can do anything: `{"user":"alice"}`
|
||||
2. Kubelet can read any pods: `{"user":"kubelet", "resource": "pods", "readonly": true}`
|
||||
3. Kubelet can read and write events: `{"user":"kubelet", "resource": "events"}`
|
||||
4. Bob can just read pods in namespace "projectCaribou": `{"user":"bob", "resource": "pods", "readonly": true, "namespace": "projectCaribou"}`
|
||||
|
||||
[Complete file example](http://releases.k8s.io/release-1.1/pkg/auth/authorizer/abac/example_policy_file.jsonl)
|
||||
|
||||
### A quick note on service accounts
|
||||
|
||||
A service account automatically generates a user. The user's name is generated according to the naming convention:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
system:serviceaccount:<namespace>:<serviceaccountname>
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Creating a new namespace also causes a new service account to be created, of this form:*
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
system:serviceaccount:<namespace>:default
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
For example, if you wanted to grant the default service account in the kube-system full privilege to the API, you would add this line to your policy file:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{"user":"system:serviceaccount:kube-system:default"}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The apiserver will need to be restarted to pickup the new policy lines.
|
||||
|
||||
## Plugin Development
|
||||
|
||||
Other implementations can be developed fairly easily.
|
||||
The APIserver calls the Authorizer interface:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type Authorizer interface {
|
||||
Authorize(a Attributes) error
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
to determine whether or not to allow each API action.
|
||||
|
||||
An authorization plugin is a module that implements this interface.
|
||||
Authorization plugin code goes in `pkg/auth/authorizer/$MODULENAME`.
|
||||
|
||||
An authorization module can be completely implemented in go, or can call out
|
||||
to a remote authorization service. Authorization modules can implement
|
||||
their own caching to reduce the cost of repeated authorization calls with the
|
||||
same or similar arguments. Developers should then consider the interaction between
|
||||
caching and revocation of permissions.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/authorization.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,136 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Cluster Admin Guide: Cluster Components"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Cluster Admin Guide: Cluster Components
|
||||
|
||||
This document outlines the various binary components that need to run to
|
||||
deliver a functioning Kubernetes cluster.
|
||||
|
||||
## Master Components
|
||||
|
||||
Master components are those that provide the cluster's control plane. For
|
||||
example, master components are responsible for making global decisions about the
|
||||
cluster (e.g., scheduling), and detecting and responding to cluster events
|
||||
(e.g., starting up a new pod when a replication controller's 'replicas' field is
|
||||
unsatisfied).
|
||||
|
||||
Master components could in theory be run on any node in the cluster. However,
|
||||
for simplicity, current set up scripts typically start all master components on
|
||||
the same VM, and does not run user containers on this VM. See
|
||||
[high-availability.md](high-availability.html) for an example multi-master-VM setup.
|
||||
|
||||
Even in the future, when Kubernetes is fully self-hosting, it will probably be
|
||||
wise to only allow master components to schedule on a subset of nodes, to limit
|
||||
co-running with user-run pods, reducing the possible scope of a
|
||||
node-compromising security exploit.
|
||||
|
||||
### kube-apiserver
|
||||
|
||||
[kube-apiserver](kube-apiserver.html) exposes the Kubernetes API; it is the front-end for the
|
||||
Kubernetes control plane. It is designed to scale horizontally (i.e., one scales
|
||||
it by running more of them-- [high-availability.md](high-availability.html)).
|
||||
|
||||
### etcd
|
||||
|
||||
[etcd](etcd.html) is used as Kubernetes' backing store. All cluster data is stored here.
|
||||
Proper administration of a Kubernetes cluster includes a backup plan for etcd's
|
||||
data.
|
||||
|
||||
### kube-controller-manager
|
||||
|
||||
[kube-controller-manager](kube-controller-manager.html) is a binary that runs controllers, which are the
|
||||
background threads that handle routine tasks in the cluster. Logically, each
|
||||
controller is a separate process, but to reduce the number of moving pieces in
|
||||
the system, they are all compiled into a single binary and run in a single
|
||||
process.
|
||||
|
||||
These controllers include:
|
||||
|
||||
* Node Controller
|
||||
* Responsible for noticing & responding when nodes go down.
|
||||
* Replication Controller
|
||||
* Responsible for maintaining the correct number of pods for every replication
|
||||
controller object in the system.
|
||||
* Endpoints Controller
|
||||
* Populates the Endpoints object (i.e., join Services & Pods).
|
||||
* Service Account & Token Controllers
|
||||
* Create default accounts and API access tokens for new namespaces.
|
||||
* ... and others.
|
||||
|
||||
### kube-scheduler
|
||||
|
||||
[kube-scheduler](kube-scheduler.html) watches newly created pods that have no node assigned, and
|
||||
selects a node for them to run on.
|
||||
|
||||
### addons
|
||||
|
||||
Addons are pods and services that implement cluster features. They don't run on
|
||||
the master VM, but currently the default setup scripts that make the API calls
|
||||
to create these pods and services does run on the master VM. See:
|
||||
[kube-master-addons](http://releases.k8s.io/release-1.1/cluster/saltbase/salt/kube-master-addons/kube-master-addons.sh)
|
||||
|
||||
Addon objects are created in the "kube-system" namespace.
|
||||
|
||||
Example addons are:
|
||||
* [DNS](http://releases.k8s.io/release-1.1/cluster/addons/dns/) provides cluster local DNS.
|
||||
* [kube-ui](http://releases.k8s.io/release-1.1/cluster/addons/kube-ui/) provides a graphical UI for the
|
||||
cluster.
|
||||
* [fluentd-elasticsearch](http://releases.k8s.io/release-1.1/cluster/addons/fluentd-elasticsearch/) provides
|
||||
log storage. Also see the [gcp version](http://releases.k8s.io/release-1.1/cluster/addons/fluentd-gcp/).
|
||||
* [cluster-monitoring](http://releases.k8s.io/release-1.1/cluster/addons/cluster-monitoring/) provides
|
||||
monitoring for the cluster.
|
||||
|
||||
## Node components
|
||||
|
||||
Node components run on every node, maintaining running pods and providing them
|
||||
the Kubernetes runtime environment.
|
||||
|
||||
### kubelet
|
||||
|
||||
[kubelet](kubelet.html) is the primary node agent. It:
|
||||
* Watches for pods that have been assigned to its node (either by apiserver
|
||||
or via local configuration file) and:
|
||||
* Mounts the pod's required volumes
|
||||
* Downloads the pod's secrets
|
||||
* Run the pod's containers via docker (or, experimentally, rkt).
|
||||
* Periodically executes any requested container liveness probes.
|
||||
* Reports the status of the pod back to the rest of the system, by creating a
|
||||
"mirror pod" if necessary.
|
||||
* Reports the status of the node back to the rest of the system.
|
||||
|
||||
### kube-proxy
|
||||
|
||||
[kube-proxy](kube-proxy.html) enables the Kubernetes service abstraction by maintaining
|
||||
network rules on the host and performing connection forwarding.
|
||||
|
||||
### docker
|
||||
|
||||
`docker` is of course used for actually running containers.
|
||||
|
||||
### rkt
|
||||
|
||||
`rkt` is supported experimentally as an alternative to docker.
|
||||
|
||||
### supervisord
|
||||
|
||||
`supervisord` is a lightweight process babysitting system for keeping kubelet and docker
|
||||
running.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/cluster-components.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,86 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Large Cluster"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Large Cluster
|
||||
|
||||
## Support
|
||||
|
||||
At v1.0, Kubernetes supports clusters up to 100 nodes with 30 pods per node and 1-2 containers per pod.
|
||||
|
||||
## Setup
|
||||
|
||||
A cluster is a set of nodes (physical or virtual machines) running Kubernetes agents, managed by a "master" (the cluster-level control plane).
|
||||
|
||||
Normally the number of nodes in a cluster is controlled by the the value `NUM_MINIONS` in the platform-specific `config-default.sh` file (for example, see [GCE's `config-default.sh`](http://releases.k8s.io/release-1.1/cluster/gce/config-default.sh)).
|
||||
|
||||
Simply changing that value to something very large, however, may cause the setup script to fail for many cloud providers. A GCE deployment, for example, will run in to quota issues and fail to bring the cluster up.
|
||||
|
||||
When setting up a large Kubernetes cluster, the following issues must be considered.
|
||||
|
||||
### Quota Issues
|
||||
|
||||
To avoid running into cloud provider quota issues, when creating a cluster with many nodes, consider:
|
||||
* Increase the quota for things like CPU, IPs, etc.
|
||||
* In [GCE, for example,](https://cloud.google.com/compute/docs/resource-quotas) you'll want to increase the quota for:
|
||||
* CPUs
|
||||
* VM instances
|
||||
* Total persistent disk reserved
|
||||
* In-use IP addresses
|
||||
* Firewall Rules
|
||||
* Forwarding rules
|
||||
* Routes
|
||||
* Target pools
|
||||
* Gating the setup script so that it brings up new node VMs in smaller batches with waits in between, because some cloud providers rate limit the creation of VMs.
|
||||
|
||||
### Addon Resources
|
||||
|
||||
To prevent memory leaks or other resource issues in [cluster addons](https://releases.k8s.io/release-1.1/cluster/addons) from consuming all the resources available on a node, Kubernetes sets resource limits on addon containers to limit the CPU and Memory resources they can consume (See PR [#10653](http://pr.k8s.io/10653/files) and [#10778](http://pr.k8s.io/10778/files)).
|
||||
|
||||
For example:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
containers:
|
||||
- image: gcr.io/google_containers/heapster:v0.15.0
|
||||
name: heapster
|
||||
resources:
|
||||
limits:
|
||||
cpu: 100m
|
||||
memory: 200Mi
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
These limits, however, are based on data collected from addons running on 4-node clusters (see [#10335](http://issue.k8s.io/10335#issuecomment-117861225)). The addons consume a lot more resources when running on large deployment clusters (see [#5880](http://issue.k8s.io/5880#issuecomment-113984085)). So, if a large cluster is deployed without adjusting these values, the addons may continuously get killed because they keep hitting the limits.
|
||||
|
||||
To avoid running into cluster addon resource issues, when creating a cluster with many nodes, consider the following:
|
||||
* Scale memory and CPU limits for each of the following addons, if used, along with the size of cluster (there is one replica of each handling the entire cluster so memory and CPU usage tends to grow proportionally with size/load on cluster):
|
||||
* Heapster ([GCM/GCL backed](http://releases.k8s.io/release-1.1/cluster/addons/cluster-monitoring/google/heapster-controller.yaml), [InfluxDB backed](http://releases.k8s.io/release-1.1/cluster/addons/cluster-monitoring/influxdb/heapster-controller.yaml), [InfluxDB/GCL backed](http://releases.k8s.io/release-1.1/cluster/addons/cluster-monitoring/googleinfluxdb/heapster-controller-combined.yaml), [standalone](http://releases.k8s.io/release-1.1/cluster/addons/cluster-monitoring/standalone/heapster-controller.yaml))
|
||||
* [InfluxDB and Grafana](http://releases.k8s.io/release-1.1/cluster/addons/cluster-monitoring/influxdb/influxdb-grafana-controller.yaml)
|
||||
* [skydns, kube2sky, and dns etcd](http://releases.k8s.io/release-1.1/cluster/addons/dns/skydns-rc.yaml.in)
|
||||
* [Kibana](http://releases.k8s.io/release-1.1/cluster/addons/fluentd-elasticsearch/kibana-controller.yaml)
|
||||
* Scale number of replicas for the following addons, if used, along with the size of cluster (there are multiple replicas of each so increasing replicas should help handle increased load, but, since load per replica also increases slightly, also consider increasing CPU/memory limits):
|
||||
* [elasticsearch](http://releases.k8s.io/release-1.1/cluster/addons/fluentd-elasticsearch/es-controller.yaml)
|
||||
* Increase memory and CPU limits slightly for each of the following addons, if used, along with the size of cluster (there is one replica per node but CPU/memory usage increases slightly along with cluster load/size as well):
|
||||
* [FluentD with ElasticSearch Plugin](http://releases.k8s.io/release-1.1/cluster/saltbase/salt/fluentd-es/fluentd-es.yaml)
|
||||
* [FluentD with GCP Plugin](http://releases.k8s.io/release-1.1/cluster/saltbase/salt/fluentd-gcp/fluentd-gcp.yaml)
|
||||
|
||||
For directions on how to detect if addon containers are hitting resource limits, see the [Troubleshooting section of Compute Resources](../user-guide/compute-resources.html#troubleshooting).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/cluster-large.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,221 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Cluster Management"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Cluster Management
|
||||
|
||||
This document describes several topics related to the lifecycle of a cluster: creating a new cluster,
|
||||
upgrading your cluster's
|
||||
master and worker nodes, performing node maintenance (e.g. kernel upgrades), and upgrading the Kubernetes API version of a
|
||||
running cluster.
|
||||
|
||||
## Creating and configuring a Cluster
|
||||
|
||||
To install Kubernetes on a set of machines, consult one of the existing [Getting Started guides](../../docs/getting-started-guides/README.html) depending on your environment.
|
||||
|
||||
## Upgrading a cluster
|
||||
|
||||
The current state of cluster upgrades is provider dependent.
|
||||
|
||||
### Master Upgrades
|
||||
|
||||
Both Google Container Engine (GKE) and
|
||||
Compute Engine Open Source (GCE-OSS) support node upgrades via a [Managed Instance Group](https://cloud.google.com/compute/docs/instance-groups/).
|
||||
Managed Instance Group upgrades sequentially delete and recreate each virtual machine, while maintaining the same
|
||||
Persistent Disk (PD) to ensure that data is retained across the upgrade.
|
||||
|
||||
In contrast, the `kube-push.sh` process used on [other platforms](#other-platforms) attempts to upgrade the binaries in
|
||||
places, without recreating the virtual machines.
|
||||
|
||||
### Node Upgrades
|
||||
|
||||
Node upgrades for GKE and GCE-OSS again use a Managed Instance Group, each node is sequentially destroyed and then recreated with new software. Any Pods that are running
|
||||
on that node need to be controlled by a Replication Controller, or manually re-created after the roll out.
|
||||
|
||||
For other platforms, `kube-push.sh` is again used, performing an in-place binary upgrade on existing machines.
|
||||
|
||||
### Upgrading Google Container Engine (GKE)
|
||||
|
||||
Google Container Engine automatically updates master components (e.g. `kube-apiserver`, `kube-scheduler`) to the latest
|
||||
version. It also handles upgrading the operating system and other components that the master runs on.
|
||||
|
||||
The node upgrade process is user-initiated and is described in the [GKE documentation.](https://cloud.google.com/container-engine/docs/clusters/upgrade)
|
||||
|
||||
### Upgrading open source Google Compute Engine clusters
|
||||
|
||||
Upgrades on open source Google Compute Engine (GCE) clusters are controlled by the ```cluster/gce/upgrade.sh``` script.
|
||||
|
||||
Get its usage by running `cluster/gce/upgrade.sh -h`.
|
||||
|
||||
For example, to upgrade just your master to a specific version (v1.0.2):
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
cluster/gce/upgrade.sh -M v1.0.2
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Alternatively, to upgrade your entire cluster to the latest stable release:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
cluster/gce/upgrade.sh release/stable
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Other platforms
|
||||
|
||||
The `cluster/kube-push.sh` script will do a rudimentary update. This process is still quite experimental, we
|
||||
recommend testing the upgrade on an experimental cluster before performing the update on a production cluster.
|
||||
|
||||
## Resizing a cluster
|
||||
|
||||
If your cluster runs short on resources you can easily add more machines to it if your cluster is running in [Node self-registration mode](node.html#self-registration-of-nodes).
|
||||
If you're using GCE or GKE it's done by resizing Instance Group managing your Nodes. It can be accomplished by modifying number of instances on `Compute > Compute Engine > Instance groups > your group > Edit group` [Google Cloud Console page](https://console.developers.google.com) or using gcloud CLI:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
gcloud compute instance-groups managed --zone compute-zone resize my-cluster-minon-group --new-size 42
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Instance Group will take care of putting appropriate image on new machines and start them, while Kubelet will register its Node with API server to make it available for scheduling. If you scale the instance group down, system will randomly choose Nodes to kill.
|
||||
|
||||
In other environments you may need to configure the machine yourself and tell the Kubelet on which machine API server is running.
|
||||
|
||||
|
||||
### Horizontal auto-scaling of nodes (GCE)
|
||||
|
||||
If you are using GCE, you can configure your cluster so that the number of nodes will be automatically scaled based on their CPU and memory utilization.
|
||||
Before setting up the cluster by ```kube-up.sh```, you can set ```KUBE_ENABLE_NODE_AUTOSCALER``` environment variable to ```true``` and export it.
|
||||
The script will create an autoscaler for the instance group managing your nodes.
|
||||
|
||||
The autoscaler will try to maintain the average CPU and memory utilization of nodes within the cluster close to the target value.
|
||||
The target value can be configured by ```KUBE_TARGET_NODE_UTILIZATION``` environment variable (default: 0.7) for ``kube-up.sh`` when creating the cluster.
|
||||
The node utilization is the total node's CPU/memory usage (OS + k8s + user load) divided by the node's capacity.
|
||||
If the desired numbers of nodes in the cluster resulting from CPU utilization and memory utilization are different,
|
||||
the autoscaler will choose the bigger number.
|
||||
The number of nodes in the cluster set by the autoscaler will be limited from ```KUBE_AUTOSCALER_MIN_NODES``` (default: 1)
|
||||
to ```KUBE_AUTOSCALER_MAX_NODES``` (default: the initial number of nodes in the cluster).
|
||||
|
||||
The autoscaler is implemented as a Compute Engine Autoscaler.
|
||||
The initial values of the autoscaler parameters set by ``kube-up.sh`` and some more advanced options can be tweaked on
|
||||
`Compute > Compute Engine > Instance groups > your group > Edit group`[Google Cloud Console page](https://console.developers.google.com)
|
||||
or using gcloud CLI:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
gcloud preview autoscaler --zone compute-zone <command>
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Note that autoscaling will work properly only if node metrics are accessible in Google Cloud Monitoring.
|
||||
To make the metrics accessible, you need to create your cluster with ```KUBE_ENABLE_CLUSTER_MONITORING```
|
||||
equal to ```google``` or ```googleinfluxdb``` (```googleinfluxdb``` is the default value).
|
||||
|
||||
## Maintenance on a Node
|
||||
|
||||
If you need to reboot a node (such as for a kernel upgrade, libc upgrade, hardware repair, etc.), and the downtime is
|
||||
brief, then when the Kubelet restarts, it will attempt to restart the pods scheduled to it. If the reboot takes longer,
|
||||
then the node controller will terminate the pods that are bound to the unavailable node. If there is a corresponding
|
||||
replication controller, then a new copy of the pod will be started on a different node. So, in the case where all
|
||||
pods are replicated, upgrades can be done without special coordination, assuming that not all nodes will go down at the same time.
|
||||
|
||||
If you want more control over the upgrading process, you may use the following workflow:
|
||||
|
||||
Mark the node to be rebooted as unschedulable:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
kubectl replace nodes $NODENAME --patch='{"apiVersion": "v1", "spec": {"unschedulable": true}}'
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
This keeps new pods from landing on the node while you are trying to get them off.
|
||||
|
||||
Get the pods off the machine, via any of the following strategies:
|
||||
* Wait for finite-duration pods to complete.
|
||||
* Delete pods with:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
kubectl delete pods $PODNAME
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
For pods with a replication controller, the pod will eventually be replaced by a new pod which will be scheduled to a new node. Additionally, if the pod is part of a service, then clients will automatically be redirected to the new pod.
|
||||
|
||||
For pods with no replication controller, you need to bring up a new copy of the pod, and assuming it is not part of a service, redirect clients to it.
|
||||
|
||||
Perform maintenance work on the node.
|
||||
|
||||
Make the node schedulable again:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
kubectl replace nodes $NODENAME --patch='{"apiVersion": "v1", "spec": {"unschedulable": false}}'
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
If you deleted the node's VM instance and created a new one, then a new schedulable node resource will
|
||||
be created automatically when you create a new VM instance (if you're using a cloud provider that supports
|
||||
node discovery; currently this is only Google Compute Engine, not including CoreOS on Google Compute Engine using kube-register). See [Node](node.html) for more details.
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Upgrading to a different API version
|
||||
|
||||
When a new API version is released, you may need to upgrade a cluster to support the new API version (e.g. switching from 'v1' to 'v2' when 'v2' is launched)
|
||||
|
||||
This is an infrequent event, but it requires careful management. There is a sequence of steps to upgrade to a new API version.
|
||||
|
||||
1. Turn on the new api version.
|
||||
1. Upgrade the cluster's storage to use the new version.
|
||||
1. Upgrade all config files. Identify users of the old API version endpoints.
|
||||
1. Update existing objects in the storage to new version by running `cluster/update-storage-objects.sh`.
|
||||
1. Turn off the old API version.
|
||||
|
||||
### Turn on or off an API version for your cluster
|
||||
|
||||
Specific API versions can be turned on or off by passing --runtime-config=api/<version> flag while bringing up the API server. For example: to turn off v1 API, pass `--runtime-config=api/v1=false`.
|
||||
runtime-config also supports 2 special keys: api/all and api/legacy to control all and legacy APIs respectively.
|
||||
For example, for turning off all api versions except v1, pass `--runtime-config=api/all=false,api/v1=true`.
|
||||
For the purposes of these flags, _legacy_ APIs are those APIs which have been explicitly deprecated (e.g. `v1beta3`).
|
||||
|
||||
### Switching your cluster's storage API version
|
||||
|
||||
The objects that are stored to disk for a cluster's internal representation of the Kubernetes resources active in the cluster are written using a particular version of the API.
|
||||
When the supported API changes, these objects may need to be rewritten in the newer API. Failure to do this will eventually result in resources that are no longer decodable or usable
|
||||
by the kubernetes API server.
|
||||
|
||||
`KUBE_API_VERSIONS` environment variable for the `kube-apiserver` binary which controls the API versions that are supported in the cluster. The first version in the list is used as the cluster's storage version. Hence, to set a specific version as the storage version, bring it to the front of list of versions in the value of `KUBE_API_VERSIONS`. You need to restart the `kube-apiserver` binary
|
||||
for changes to this variable to take effect.
|
||||
|
||||
### Switching your config files to a new API version
|
||||
|
||||
You can use the `kube-version-change` utility to convert config files between different API versions.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ hack/build-go.sh cmd/kube-version-change
|
||||
$ _output/local/go/bin/kube-version-change -i myPod.v1beta3.yaml -o myPod.v1.yaml
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/cluster-management.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,132 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Cluster Troubleshooting"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Cluster Troubleshooting
|
||||
|
||||
This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the
|
||||
problem you are experiencing. See
|
||||
the [application troubleshooting guide](../user-guide/application-troubleshooting.html) for tips on application debugging.
|
||||
You may also visit [troubleshooting document](../troubleshooting.html) for more information.
|
||||
|
||||
## Listing your cluster
|
||||
|
||||
The first thing to debug in your cluster is if your nodes are all registered correctly.
|
||||
|
||||
Run
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
kubectl get nodes
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.
|
||||
|
||||
## Looking at logs
|
||||
|
||||
For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations
|
||||
of the relevant log files. (note that on systemd-based systems, you may need to use `journalctl` instead)
|
||||
|
||||
### Master
|
||||
|
||||
* /var/log/kube-apiserver.log - API Server, responsible for serving the API
|
||||
* /var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
|
||||
* /var/log/kube-controller-manager.log - Controller that manages replication controllers
|
||||
|
||||
### Worker Nodes
|
||||
|
||||
* /var/log/kubelet.log - Kubelet, responsible for running containers on the node
|
||||
* /var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing
|
||||
|
||||
## A general overview of cluster failure modes
|
||||
|
||||
This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
|
||||
|
||||
Root causes:
|
||||
- VM(s) shutdown
|
||||
- Network partition within cluster, or between cluster and users
|
||||
- Crashes in Kubernetes software
|
||||
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
|
||||
- Operator error, e.g. misconfigured Kubernetes software or application software
|
||||
|
||||
Specific scenarios:
|
||||
- Apiserver VM shutdown or apiserver crashing
|
||||
- Results
|
||||
- unable to stop, update, or start new pods, services, replication controller
|
||||
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
|
||||
- Apiserver backing storage lost
|
||||
- Results
|
||||
- apiserver should fail to come up
|
||||
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
|
||||
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
|
||||
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
|
||||
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
|
||||
- in future, these will be replicated as well and may not be co-located
|
||||
- they do not have their own persistent state
|
||||
- Individual node (VM or physical machine) shuts down
|
||||
- Results
|
||||
- pods on that Node stop running
|
||||
- Network partition
|
||||
- Results
|
||||
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
|
||||
- Kubelet software fault
|
||||
- Results
|
||||
- crashing kubelet cannot start new pods on the node
|
||||
- kubelet might delete the pods or not
|
||||
- node marked unhealthy
|
||||
- replication controllers start new pods elsewhere
|
||||
- Cluster operator error
|
||||
- Results
|
||||
- loss of pods, services, etc
|
||||
- lost of apiserver backing store
|
||||
- users unable to read API
|
||||
- etc.
|
||||
|
||||
Mitigations:
|
||||
- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
|
||||
- Mitigates: Apiserver VM shutdown or apiserver crashing
|
||||
- Mitigates: Supporting services VM shutdown or crashes
|
||||
|
||||
- Action use IaaS providers reliable storage (e.g GCE PD or AWS EBS volume) for VMs with apiserver+etcd
|
||||
- Mitigates: Apiserver backing storage lost
|
||||
|
||||
- Action: Use (experimental) [high-availability](high-availability.html) configuration
|
||||
- Mitigates: Master VM shutdown or master components (scheduler, API server, controller-managing) crashing
|
||||
- Will tolerate one or more simultaneous node or component failures
|
||||
- Mitigates: Apiserver backing storage (i.e., etcd's data directory) lost
|
||||
- Assuming you used clustered etcd.
|
||||
|
||||
- Action: Snapshot apiserver PDs/EBS-volumes periodically
|
||||
- Mitigates: Apiserver backing storage lost
|
||||
- Mitigates: Some cases of operator error
|
||||
- Mitigates: Some cases of Kubernetes software fault
|
||||
|
||||
- Action: use replication controller and services in front of pods
|
||||
- Mitigates: Node shutdown
|
||||
- Mitigates: Kubelet software fault
|
||||
|
||||
- Action: applications (containers) designed to tolerate unexpected restarts
|
||||
- Mitigates: Node shutdown
|
||||
- Mitigates: Kubelet software fault
|
||||
|
||||
- Action: [Multiple independent clusters](multi-cluster.html) (and avoid making risky changes to all clusters at once)
|
||||
- Mitigates: Everything listed above.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/cluster-troubleshooting.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: prometheus-node-exporter
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
name: prometheus-node-exporter
|
||||
labels:
|
||||
daemon: prom-node-exp
|
||||
spec:
|
||||
containers:
|
||||
- name: c
|
||||
image: prom/prometheus
|
||||
ports:
|
||||
- containerPort: 9090
|
||||
hostPort: 9090
|
||||
name: serverport
|
|
@ -0,0 +1,210 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Daemon Sets"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Daemon Sets
|
||||
|
||||
**Table of Contents**
|
||||
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||
|
||||
- [Daemon Sets](#daemon-sets)
|
||||
- [What is a _Daemon Set_?](#what-is-a-daemon-set)
|
||||
- [Writing a DaemonSet Spec](#writing-a-daemonset-spec)
|
||||
- [Required Fields](#required-fields)
|
||||
- [Pod Template](#pod-template)
|
||||
- [Pod Selector](#pod-selector)
|
||||
- [Running Pods on Only Some Nodes](#running-pods-on-only-some-nodes)
|
||||
- [How Daemon Pods are Scheduled](#how-daemon-pods-are-scheduled)
|
||||
- [Communicating with DaemonSet Pods](#communicating-with-daemonset-pods)
|
||||
- [Updating a DaemonSet](#updating-a-daemonset)
|
||||
- [Alternatives to Daemon Set](#alternatives-to-daemon-set)
|
||||
- [Init Scripts](#init-scripts)
|
||||
- [Bare Pods](#bare-pods)
|
||||
- [Static Pods](#static-pods)
|
||||
- [Replication Controller](#replication-controller)
|
||||
- [Caveats](#caveats)
|
||||
|
||||
<!-- END MUNGE: GENERATED_TOC -->
|
||||
|
||||
## What is a _Daemon Set_?
|
||||
|
||||
A _Daemon Set_ ensures that all (or some) nodes run a copy of a pod. As nodes are added to the
|
||||
cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage
|
||||
collected. Deleting a Daemon Set will clean up the pods it created.
|
||||
|
||||
Some typical uses of a Daemon Set are:
|
||||
|
||||
- running a cluster storage daemon, such as `glusterd`, `ceph`, on each node.
|
||||
- running a logs collection daemon on every node, such as `fluentd` or `logstash`.
|
||||
- running a node monitoring daemon on every node, such as [Prometheus Node Exporter](
|
||||
https://github.com/prometheus/node_exporter), `collectd`, New Relic agent, or Ganglia `gmond`.
|
||||
|
||||
In a simple case, one Daemon Set, covering all nodes, would be used for each type of daemon.
|
||||
A more complex setup might use multiple DaemonSets would be used for a single type of daemon,
|
||||
but with different flags and/or different memory and cpu requests for different hardware types.
|
||||
|
||||
## Writing a DaemonSet Spec
|
||||
|
||||
### Required Fields
|
||||
|
||||
As with all other Kubernetes config, a DaemonSet needs `apiVersion`, `kind`, and `metadata` fields. For
|
||||
general information about working with config files, see [here](../user-guide/simple-yaml.html),
|
||||
[here](../user-guide/configuring-containers.html), and [here](../user-guide/working-with-resources.html).
|
||||
|
||||
A DaemonSet also needs a [`.spec`](../devel/api-conventions.html#spec-and-status) section.
|
||||
|
||||
### Pod Template
|
||||
|
||||
The `.spec.template` is the only required field of the `.spec`.
|
||||
|
||||
The `.spec.template` is a [pod template](../user-guide/replication-controller.html#pod-template).
|
||||
It has exactly the same schema as a [pod](../user-guide/pods.html), except
|
||||
it is nested and does not have an `apiVersion` or `kind`.
|
||||
|
||||
In addition to required fields for a pod, a pod template in a DaemonSet has to specify appropriate
|
||||
labels (see [pod selector](#pod-selector)).
|
||||
|
||||
A pod template in a DaemonSet must have a [`RestartPolicy`](../user-guide/pod-states.html)
|
||||
equal to `Always`, or be unspecified, which defaults to `Always`.
|
||||
|
||||
### Pod Selector
|
||||
|
||||
The `.spec.selector` field is a pod selector. It works the same as the `.spec.selector` of
|
||||
a [ReplicationController](../user-guide/replication-controller.html) or
|
||||
[Job](../user-guide/jobs.html).
|
||||
|
||||
If the `.spec.selector` is specified, it must equal the `.spec.template.metadata.labels`. If not
|
||||
specified, the are default to be equal. Config with these unequal will be rejected by the API.
|
||||
|
||||
Also you should not normally create any pods whose labels match this selector, either directly, via
|
||||
another DaemonSet, or via other controller such as ReplicationController. Otherwise, the DaemonSet
|
||||
controller will think that those pods were created by it. Kubernetes will not stop you from doing
|
||||
this. Once case where you might want to do this is manually create a pod with a different value on
|
||||
a node for testing.
|
||||
|
||||
### Running Pods on Only Some Nodes
|
||||
|
||||
If you specify a `.spec.template.spec.nodeSelector`, then the DaemonSet controller will
|
||||
create pods on nodes which match that [node
|
||||
selector](../user-guide/node-selection/README.html).
|
||||
|
||||
If you do not specify a `.spec.template.spec.nodeSelector`, then the DaemonSet controller will
|
||||
create pods on all nodes.
|
||||
|
||||
## How Daemon Pods are Scheduled
|
||||
|
||||
Normally, the machine that a pod runs on is selected by the Kubernetes scheduler. However, pods
|
||||
created by the Daemon controller have the machine already selected (`.spec.nodeName` is specified
|
||||
when the pod is created, so it is ignored by the scheduler). Therefore:
|
||||
|
||||
- the [`unschedulable`](node.html#manual-node-administration) field of a node is not respected
|
||||
by the daemon set controller.
|
||||
- daemon set controller can make pods even when the scheduler has not been started, which can help cluster
|
||||
bootstrap.
|
||||
|
||||
## Communicating with DaemonSet Pods
|
||||
|
||||
Some possible patterns for communicating with pods in a DaemonSet are:
|
||||
|
||||
- **Push**: Pods in the Daemon Set are configured to send updates to another service, such
|
||||
as a stats database. They do not have clients.
|
||||
- **NodeIP and Known Port**: Pods in the Daemon Set use a `hostPort`, so that the pods are reachable
|
||||
via the node IPs. Clients knows the the list of nodes ips somehow, and know the port by convention.
|
||||
- **DNS**: Create a [headless service](../user-guide/services.html#headless-services) with the same pod selector,
|
||||
and then discover DaemonSets using the `endpoints` resource or retrieve multiple A records from
|
||||
DNS.
|
||||
- **Service**: Create a service with the same pod selector, and use the service to reach a
|
||||
daemon on a random node. (No way to reach specific node.)
|
||||
|
||||
## Updating a DaemonSet
|
||||
|
||||
If node labels are changed, the DaemonSet will promptly add pods to newly matching nodes and delete
|
||||
pods from newly not-matching nodes.
|
||||
|
||||
You can modify the pods that a DaemonSet creates. However, pods do not allow all
|
||||
fields to be updated. Also, the DeamonSet controller will use the original template the next
|
||||
time a node (even with the same name) is created.
|
||||
|
||||
|
||||
You can delete a DeamonSet. If you specify `--cascade=false` with `kubectl`, then the pods
|
||||
will be left on the nodes. You can then create a new DaemonSet with a different template.
|
||||
the new DaemonSet with the different template will recognize all the existing pods as having
|
||||
matching labels. It will not modify or delete them despite a mismatch in the pod template.
|
||||
You will need to force new pod creation by deleting the pod or deleting the node.
|
||||
|
||||
You cannot update a DaemonSet.
|
||||
|
||||
Support for updating DaemonSets and controlled updating of nodes is planned.
|
||||
|
||||
## Alternatives to Daemon Set
|
||||
|
||||
### Init Scripts
|
||||
|
||||
It is certainly possible to run daemon processes by directly starting them on a node (e.g using
|
||||
`init`, `upstartd`, or `systemd`). This is perfectly fine. However, there are several advantages to
|
||||
running such processes via a DaemonSet:
|
||||
|
||||
- Ability to monitor and manage logs for daemons in the same way as applications.
|
||||
- Same config language and tools (e.g. pod templates, `kubectl`) for daemons and applications.
|
||||
- Future versions of Kubernetes will likely support integration between DaemonSet-created
|
||||
pods and node upgrade workflows.
|
||||
- Running daemons in containers with resource limits increases isolation between daemons from app
|
||||
containers. However, this can also be accomplished by running the daemons in a container but not in a pod
|
||||
(e.g. start directly via Docker).
|
||||
|
||||
### Bare Pods
|
||||
|
||||
It is possible to create pods directly which specify a particular node to run on. However,
|
||||
a Daemon Set replaces pods that are deleted or terminated for any reason, such as in the case of
|
||||
node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, you should
|
||||
use a Daemon Set rather than creating individual pods.
|
||||
|
||||
### Static Pods
|
||||
|
||||
It is possible to create pods by writing a file to a certain directory watched by Kubelet. These
|
||||
are called [static pods](static-pods.html).
|
||||
Unlike DaemonSet, static pods cannot be managed with kubectl
|
||||
or other Kubernetes API clients. Static pods do not depend on the apiserver, making them useful
|
||||
in cluster bootstrapping cases. Also, static pods may be deprecated in the future.
|
||||
|
||||
### Replication Controller
|
||||
|
||||
Daemon Set are similar to [Replication Controllers](../user-guide/replication-controller.html) in that
|
||||
they both create pods, and those pods have processes which are not expected to terminate (e.g. web servers,
|
||||
storage servers).
|
||||
|
||||
Use a replication controller for stateless services, like frontends, where scaling up and down the
|
||||
number of replicas and rolling out updates are more important than controlling exactly which host
|
||||
the pod runs on. Use a Daemon Controller when it is important that a copy of a pod always run on
|
||||
all or certain hosts, and when it needs to start before other pods.
|
||||
|
||||
## Caveats
|
||||
|
||||
DaemonSet objects are in the [`extensions` API Group](../api.html#api-groups).
|
||||
DaemonSet is not enabled by default. Enable it by setting
|
||||
`--runtime-config=extensions/v1beta1/daemonsets=true` on the api server. This can be
|
||||
achieved by exporting ENABLE_DAEMONSETS=true before running kube-up.sh script
|
||||
on GCE.
|
||||
|
||||
DaemonSet objects effectively have [API version `v1alpha1`](../api.html#api-versioning).
|
||||
Alpha objects may change or even be discontinued in future software releases.
|
||||
However, due to to a known issue, they will appear as API version `v1beta1` if enabled.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/daemons.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,60 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "DNS Integration with Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# DNS Integration with Kubernetes
|
||||
|
||||
As of Kubernetes 0.8, DNS is offered as a [cluster add-on](http://releases.k8s.io/release-1.1/cluster/addons/README.md).
|
||||
If enabled, a DNS Pod and Service will be scheduled on the cluster, and the kubelets will be
|
||||
configured to tell individual containers to use the DNS Service's IP to resolve DNS names.
|
||||
|
||||
Every Service defined in the cluster (including the DNS server itself) will be
|
||||
assigned a DNS name. By default, a client Pod's DNS search list will
|
||||
include the Pod's own namespace and the cluster's default domain. This is best
|
||||
illustrated by example:
|
||||
|
||||
Assume a Service named `foo` in the Kubernetes namespace `bar`. A Pod running
|
||||
in namespace `bar` can look up this service by simply doing a DNS query for
|
||||
`foo`. A Pod running in namespace `quux` can look up this service by doing a
|
||||
DNS query for `foo.bar`.
|
||||
|
||||
The cluster DNS server ([SkyDNS](https://github.com/skynetservices/skydns))
|
||||
supports forward lookups (A records) and service lookups (SRV records).
|
||||
|
||||
## How it Works
|
||||
|
||||
The running DNS pod holds 3 containers - skydns, etcd (a private instance which skydns uses),
|
||||
and a Kubernetes-to-skydns bridge called kube2sky. The kube2sky process
|
||||
watches the Kubernetes master for changes in Services, and then writes the
|
||||
information to etcd, which skydns reads. This etcd instance is not linked to
|
||||
any other etcd clusters that might exist, including the Kubernetes master.
|
||||
|
||||
## Issues
|
||||
|
||||
The skydns service is reachable directly from Kubernetes nodes (outside
|
||||
of any container) and DNS resolution works if the skydns service is targeted
|
||||
explicitly. However, nodes are not configured to use the cluster DNS service or
|
||||
to search the cluster's DNS domain by default. This may be resolved at a later
|
||||
time.
|
||||
|
||||
## For more information
|
||||
|
||||
See [the docs for the DNS cluster addon](http://releases.k8s.io/release-1.1/cluster/addons/dns/README.md).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/dns.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,69 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "etcd"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# etcd
|
||||
|
||||
[etcd](https://coreos.com/etcd/docs/2.0.12/) is a highly-available key value
|
||||
store which Kubernetes uses for persistent storage of all of its REST API
|
||||
objects.
|
||||
|
||||
## Configuration: high-level goals
|
||||
|
||||
Access Control: give *only* kube-apiserver read/write access to etcd. You do not
|
||||
want apiserver's etcd exposed to every node in your cluster (or worse, to the
|
||||
internet at large), because access to etcd is equivalent to root in your
|
||||
cluster.
|
||||
|
||||
Data Reliability: for reasonable safety, either etcd needs to be run as a
|
||||
[cluster](high-availability.html#clustering-etcd) (multiple machines each running
|
||||
etcd) or etcd's data directory should be located on durable storage (e.g., GCE's
|
||||
persistent disk). In either case, if high availability is required--as it might
|
||||
be in a production cluster--the data directory ought to be [backed up
|
||||
periodically](https://coreos.com/etcd/docs/2.0.12/admin_guide.html#disaster-recovery),
|
||||
to reduce downtime in case of corruption.
|
||||
|
||||
## Default configuration
|
||||
|
||||
The default setup scripts use kubelet's file-based static pods feature to run etcd in a
|
||||
[pod](http://releases.k8s.io/release-1.1/cluster/saltbase/salt/etcd/etcd.manifest). This manifest should only
|
||||
be run on master VMs. The default location that kubelet scans for manifests is
|
||||
`/etc/kubernetes/manifests/`.
|
||||
|
||||
## Kubernetes's usage of etcd
|
||||
|
||||
By default, Kubernetes objects are stored under the `/registry` key in etcd.
|
||||
This path can be prefixed by using the [kube-apiserver](kube-apiserver.html) flag
|
||||
`--etcd-prefix="/foo"`.
|
||||
|
||||
`etcd` is the only place that Kubernetes keeps state.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
To test whether `etcd` is running correctly, you can try writing a value to a
|
||||
test key. On your master VM (or somewhere with firewalls configured such that
|
||||
you can talk to your cluster's etcd), try:
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
curl -fs -X PUT "http://${host}:${port}/v2/keys/_test"
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/etcd.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,93 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Garbage Collection"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Garbage Collection
|
||||
|
||||
- [Introduction](#introduction)
|
||||
- [Image Collection](#image-collection)
|
||||
- [Container Collection](#container-collection)
|
||||
- [User Configuration](#user-configuration)
|
||||
|
||||
### Introduction
|
||||
|
||||
Garbage collection is managed by kubelet automatically, mainly including unreferenced
|
||||
images and dead containers. kubelet applies container garbage collection every minute
|
||||
and image garbage collection every 5 minutes.
|
||||
Note that we don't recommend external garbage collection tool generally, since it could
|
||||
break the behavior of kubelet potentially if it attempts to remove all of the containers
|
||||
which acts as the tombstone kubelet relies on. Yet those garbage collector aims to deal
|
||||
with the docker leaking issues would be appreciated.
|
||||
|
||||
### Image Collection
|
||||
|
||||
kubernetes manages lifecycle of all images through imageManager, with the cooperation
|
||||
of cadvisor.
|
||||
The policy for garbage collecting images we apply takes two factors into consideration,
|
||||
`HighThresholdPercent` and `LowThresholdPercent`. Disk usage above the the high threshold
|
||||
will trigger garbage collection, which attempts to delete unused images until the low
|
||||
threshold is met. Least recently used images are deleted first.
|
||||
|
||||
### Container Collection
|
||||
|
||||
The policy for garbage collecting containers we apply takes on three variables, which can
|
||||
be user-defined. `MinAge` is the minimum age at which a container can be garbage collected,
|
||||
zero for no limit. `MaxPerPodContainer` is the max number of dead containers any single
|
||||
pod (UID, container name) pair is allowed to have, less than zero for no limit.
|
||||
`MaxContainers` is the max number of total dead containers, less than zero for no limit as well.
|
||||
|
||||
kubelet sorts out containers which are unidentified or stay out of bounds set by previous
|
||||
mentioned three flags. Gernerally the oldest containers are removed first. Since we take both
|
||||
`MaxPerPodContainer` and `MaxContainers` into consideration, it could happen when they
|
||||
have conflict -- retaining the max number of containers per pod goes out of range set by max
|
||||
number of global dead containers. In this case, we would sacrifice the `MaxPerPodContainer`
|
||||
a little bit. For the worst case, we first downgrade it to 1 container per pod, and then
|
||||
evict the oldest containers for the greater good.
|
||||
|
||||
When kubelet removes the dead containers, all the files inside the container will be cleaned up as well.
|
||||
Note that we will skip the containers that are not managed by kubelet.
|
||||
|
||||
### User Configuration
|
||||
|
||||
Users are free to set their own value to address image garbage collection.
|
||||
|
||||
1. `image-gc-high-threshold`, the percent of disk usage which triggers image garbage collection.
|
||||
Default is 90%.
|
||||
2. `image-gc-low-threshold`, the percent of disk usage to which image garbage collection attempts
|
||||
to free. Default is 80%.
|
||||
|
||||
We also allow users to customize garbage collection policy, basically via following three flags.
|
||||
|
||||
1. `minimum-container-ttl-duration`, minimum age for a finished container before it is
|
||||
garbage collected. Default is 1 minute.
|
||||
2. `maximum-dead-containers-per-container`, maximum number of old instances to retain
|
||||
per container. Default is 2.
|
||||
3. `maximum-dead-containers`, maximum number of old instances of containers to retain globally.
|
||||
Default is 100.
|
||||
|
||||
Note that we highly recommend a large enough value for `maximum-dead-containers-per-container`
|
||||
to allow at least 2 dead containers retaining per expected container when you customize the flag
|
||||
configuration. A loose value for `maximum-dead-containers` also assumes importance for a similar reason.
|
||||
See [this issue](https://github.com/kubernetes/kubernetes/issues/13287) for more details.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/garbage-collection.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,280 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "High Availability Kubernetes Clusters"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# High Availability Kubernetes Clusters
|
||||
|
||||
**Table of Contents**
|
||||
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||
|
||||
- [High Availability Kubernetes Clusters](#high-availability-kubernetes-clusters)
|
||||
- [Introduction](#introduction)
|
||||
- [Overview](#overview)
|
||||
- [Initial set-up](#initial-set-up)
|
||||
- [Reliable nodes](#reliable-nodes)
|
||||
- [Establishing a redundant, reliable data storage layer](#establishing-a-redundant-reliable-data-storage-layer)
|
||||
- [Clustering etcd](#clustering-etcd)
|
||||
- [Validating your cluster](#validating-your-cluster)
|
||||
- [Even more reliable storage](#even-more-reliable-storage)
|
||||
- [Replicated API Servers](#replicated-api-servers)
|
||||
- [Installing configuration files](#installing-configuration-files)
|
||||
- [Starting the API Server](#starting-the-api-server)
|
||||
- [Load balancing](#load-balancing)
|
||||
- [Master elected components](#master-elected-components)
|
||||
- [Installing configuration files](#installing-configuration-files)
|
||||
- [Running the podmaster](#running-the-podmaster)
|
||||
- [Conclusion](#conclusion)
|
||||
- [Vagrant up!](#vagrant-up)
|
||||
|
||||
<!-- END MUNGE: GENERATED_TOC -->
|
||||
|
||||
## Introduction
|
||||
|
||||
This document describes how to build a high-availability (HA) Kubernetes cluster. This is a fairly advanced topic.
|
||||
Users who merely want to experiment with Kubernetes are encouraged to use configurations that are simpler to set up such as
|
||||
the simple [Docker based single node cluster instructions](../../docs/getting-started-guides/docker.html),
|
||||
or try [Google Container Engine](https://cloud.google.com/container-engine/) for hosted Kubernetes.
|
||||
|
||||
Also, at this time high availability support for Kubernetes is not continuously tested in our end-to-end (e2e) testing. We will
|
||||
be working to add this continuous testing, but for now the single-node master installations are more heavily tested.
|
||||
|
||||
## Overview
|
||||
|
||||
Setting up a truly reliable, highly available distributed system requires a number of steps, it is akin to
|
||||
wearing underwear, pants, a belt, suspenders, another pair of underwear, and another pair of pants. We go into each
|
||||
of these steps in detail, but a summary is given here to help guide and orient the user.
|
||||
|
||||
The steps involved are as follows:
|
||||
* [Creating the reliable constituent nodes that collectively form our HA master implementation.](#reliable-nodes)
|
||||
* [Setting up a redundant, reliable storage layer with clustered etcd.](#establishing-a-redundant-reliable-data-storage-layer)
|
||||
* [Starting replicated, load balanced Kubernetes API servers](#replicated-api-servers)
|
||||
* [Setting up master-elected Kubernetes scheduler and controller-manager daemons](#master-elected-components)
|
||||
|
||||
Here's what the system should look like when it's finished:
|
||||
![High availability Kubernetes diagram](high-availability/ha.png)
|
||||
|
||||
Ready? Let's get started.
|
||||
|
||||
## Initial set-up
|
||||
|
||||
The remainder of this guide assumes that you are setting up a 3-node clustered master, where each machine is running some flavor of Linux.
|
||||
Examples in the guide are given for Debian distributions, but they should be easily adaptable to other distributions.
|
||||
Likewise, this set up should work whether you are running in a public or private cloud provider, or if you are running
|
||||
on bare metal.
|
||||
|
||||
The easiest way to implement an HA Kubernetes cluster is to start with an existing single-master cluster. The
|
||||
instructions at [https://get.k8s.io](https://get.k8s.io)
|
||||
describe easy installation for single-master clusters on a variety of platforms.
|
||||
|
||||
## Reliable nodes
|
||||
|
||||
On each master node, we are going to run a number of processes that implement the Kubernetes API. The first step in making these reliable is
|
||||
to make sure that each automatically restarts when it fails. To achieve this, we need to install a process watcher. We choose to use
|
||||
the `kubelet` that we run on each of the worker nodes. This is convenient, since we can use containers to distribute our binaries, we can
|
||||
establish resource limits, and introspect the resource usage of each daemon. Of course, we also need something to monitor the kubelet
|
||||
itself (insert who watches the watcher jokes here). For Debian systems, we choose monit, but there are a number of alternate
|
||||
choices. For example, on systemd-based systems (e.g. RHEL, CentOS), you can run 'systemctl enable kubelet'.
|
||||
|
||||
If you are extending from a standard Kubernetes installation, the `kubelet` binary should already be present on your system. You can run
|
||||
`which kubelet` to determine if the binary is in fact installed. If it is not installed,
|
||||
you should install the [kubelet binary](https://storage.googleapis.com/kubernetes-release/release/v0.19.3/bin/linux/amd64/kubelet), the
|
||||
[kubelet init file](http://releases.k8s.io/release-1.1/cluster/saltbase/salt/kubelet/initd) and [high-availability/default-kubelet](high-availability/default-kubelet)
|
||||
scripts.
|
||||
|
||||
If you are using monit, you should also install the monit daemon (`apt-get install monit`) and the [high-availability/monit-kubelet](high-availability/monit-kubelet) and
|
||||
[high-availability/monit-docker](high-availability/monit-docker) configs.
|
||||
|
||||
On systemd systems you `systemctl enable kubelet` and `systemctl enable docker`.
|
||||
|
||||
|
||||
## Establishing a redundant, reliable data storage layer
|
||||
|
||||
The central foundation of a highly available solution is a redundant, reliable storage layer. The number one rule of high-availability is
|
||||
to protect the data. Whatever else happens, whatever catches on fire, if you have the data, you can rebuild. If you lose the data, you're
|
||||
done.
|
||||
|
||||
Clustered etcd already replicates your storage to all master instances in your cluster. This means that to lose data, all three nodes would need
|
||||
to have their physical (or virtual) disks fail at the same time. The probability that this occurs is relatively low, so for many people
|
||||
running a replicated etcd cluster is likely reliable enough. You can add additional reliability by increasing the
|
||||
size of the cluster from three to five nodes. If that is still insufficient, you can add
|
||||
[even more redundancy to your storage layer](#even-more-reliable-storage).
|
||||
|
||||
### Clustering etcd
|
||||
|
||||
The full details of clustering etcd are beyond the scope of this document, lots of details are given on the
|
||||
[etcd clustering page](https://github.com/coreos/etcd/blob/master/Documentation/clustering.md). This example walks through
|
||||
a simple cluster set up, using etcd's built in discovery to build our cluster.
|
||||
|
||||
First, hit the etcd discovery service to create a new token:
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
curl https://discovery.etcd.io/new?size=3
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
On each node, copy the [etcd.yaml](high-availability/etcd.yaml) file into `/etc/kubernetes/manifests/etcd.yaml`
|
||||
|
||||
The kubelet on each node actively monitors the contents of that directory, and it will create an instance of the `etcd`
|
||||
server from the definition of the pod specified in `etcd.yaml`.
|
||||
|
||||
Note that in `etcd.yaml` you should substitute the token URL you got above for `${DISCOVERY_TOKEN}` on all three machines,
|
||||
and you should substitute a different name (e.g. `node-1`) for ${NODE_NAME} and the correct IP address
|
||||
for `${NODE_IP}` on each machine.
|
||||
|
||||
|
||||
#### Validating your cluster
|
||||
|
||||
Once you copy this into all three nodes, you should have a clustered etcd set up. You can validate with
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
etcdctl member list
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
and
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
etcdctl cluster-health
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
You can also validate that this is working with `etcdctl set foo bar` on one node, and `etcd get foo`
|
||||
on a different node.
|
||||
|
||||
### Even more reliable storage
|
||||
|
||||
Of course, if you are interested in increased data reliability, there are further options which makes the place where etcd
|
||||
installs it's data even more reliable than regular disks (belts *and* suspenders, ftw!).
|
||||
|
||||
If you use a cloud provider, then they usually provide this
|
||||
for you, for example [Persistent Disk](https://cloud.google.com/compute/docs/disks/persistent-disks) on the Google Cloud Platform. These
|
||||
are block-device persistent storage that can be mounted onto your virtual machine. Other cloud providers provide similar solutions.
|
||||
|
||||
If you are running on physical machines, you can also use network attached redundant storage using an iSCSI or NFS interface.
|
||||
Alternatively, you can run a clustered file system like Gluster or Ceph. Finally, you can also run a RAID array on each physical machine.
|
||||
|
||||
Regardless of how you choose to implement it, if you chose to use one of these options, you should make sure that your storage is mounted
|
||||
to each machine. If your storage is shared between the three masters in your cluster, you should create a different directory on the storage
|
||||
for each node. Throughout these instructions, we assume that this storage is mounted to your machine in `/var/etcd/data`
|
||||
|
||||
|
||||
## Replicated API Servers
|
||||
|
||||
Once you have replicated etcd set up correctly, we will also install the apiserver using the kubelet.
|
||||
|
||||
### Installing configuration files
|
||||
|
||||
First you need to create the initial log file, so that Docker mounts a file instead of a directory:
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
touch /var/log/kube-apiserver.log
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Next, you need to create a `/srv/kubernetes/` directory on each node. This directory includes:
|
||||
* basic_auth.csv - basic auth user and password
|
||||
* ca.crt - Certificate Authority cert
|
||||
* known_tokens.csv - tokens that entities (e.g. the kubelet) can use to talk to the apiserver
|
||||
* kubecfg.crt - Client certificate, public key
|
||||
* kubecfg.key - Client certificate, private key
|
||||
* server.cert - Server certificate, public key
|
||||
* server.key - Server certificate, private key
|
||||
|
||||
The easiest way to create this directory, may be to copy it from the master node of a working cluster, or you can manually generate these files yourself.
|
||||
|
||||
### Starting the API Server
|
||||
|
||||
Once these files exist, copy the [kube-apiserver.yaml](high-availability/kube-apiserver.yaml) into `/etc/kubernetes/manifests/` on each master node.
|
||||
|
||||
The kubelet monitors this directory, and will automatically create an instance of the `kube-apiserver` container using the pod definition specified
|
||||
in the file.
|
||||
|
||||
### Load balancing
|
||||
|
||||
At this point, you should have 3 apiservers all working correctly. If you set up a network load balancer, you should
|
||||
be able to access your cluster via that load balancer, and see traffic balancing between the apiserver instances. Setting
|
||||
up a load balancer will depend on the specifics of your platform, for example instructions for the Google Cloud
|
||||
Platform can be found [here](https://cloud.google.com/compute/docs/load-balancing/)
|
||||
|
||||
Note, if you are using authentication, you may need to regenerate your certificate to include the IP address of the balancer,
|
||||
in addition to the IP addresses of the individual nodes.
|
||||
|
||||
For pods that you deploy into the cluster, the `kubernetes` service/dns name should provide a load balanced endpoint for the master automatically.
|
||||
|
||||
For external users of the API (e.g. the `kubectl` command line interface, continuous build pipelines, or other clients) you will want to configure
|
||||
them to talk to the external load balancer's IP address.
|
||||
|
||||
## Master elected components
|
||||
|
||||
So far we have set up state storage, and we have set up the API server, but we haven't run anything that actually modifies
|
||||
cluster state, such as the controller manager and scheduler. To achieve this reliably, we only want to have one actor modifying state at a time, but we want replicated
|
||||
instances of these actors, in case a machine dies. To achieve this, we are going to use a lease-lock in etcd to perform
|
||||
master election. On each of the three apiserver nodes, we run a small utility application named `podmaster`. It's job is to implement a master
|
||||
election protocol using etcd "compare and swap". If the apiserver node wins the election, it starts the master component it is managing (e.g. the scheduler), if it
|
||||
loses the election, it ensures that any master components running on the node (e.g. the scheduler) are stopped.
|
||||
|
||||
In the future, we expect to more tightly integrate this lease-locking into the scheduler and controller-manager binaries directly, as described in the [high availability design proposal](../proposals/high-availability.html)
|
||||
|
||||
### Installing configuration files
|
||||
|
||||
First, create empty log files on each node, so that Docker will mount the files not make new directories:
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
touch /var/log/kube-scheduler.log
|
||||
touch /var/log/kube-controller-manager.log
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Next, set up the descriptions of the scheduler and controller manager pods on each node.
|
||||
by copying [kube-scheduler.yaml](high-availability/kube-scheduler.yaml) and [kube-controller-manager.yaml](high-availability/kube-controller-manager.yaml) into the `/srv/kubernetes/`
|
||||
directory.
|
||||
|
||||
### Running the podmaster
|
||||
|
||||
Now that the configuration files are in place, copy the [podmaster.yaml](high-availability/podmaster.yaml) config file into `/etc/kubernetes/manifests/`
|
||||
|
||||
As before, the kubelet on the node monitors this directory, and will start an instance of the podmaster using the pod specification provided in `podmaster.yaml`.
|
||||
|
||||
Now you will have one instance of the scheduler process running on a single master node, and likewise one
|
||||
controller-manager process running on a single (possibly different) master node. If either of these processes fail,
|
||||
the kubelet will restart them. If any of these nodes fail, the process will move to a different instance of a master
|
||||
node.
|
||||
|
||||
## Conclusion
|
||||
|
||||
At this point, you are done (yeah!) with the master components, but you still need to add worker nodes (boo!).
|
||||
|
||||
If you have an existing cluster, this is as simple as reconfiguring your kubelets to talk to the load-balanced endpoint, and
|
||||
restarting the kubelets on each node.
|
||||
|
||||
If you are turning up a fresh cluster, you will need to install the kubelet and kube-proxy on each worker node, and
|
||||
set the `--apiserver` flag to your replicated endpoint.
|
||||
|
||||
## Vagrant up!
|
||||
|
||||
We indeed have an initial proof of concept tester for this, which is available [here](https://releases.k8s.io/release-1.1/examples/high-availability).
|
||||
|
||||
It implements the major concepts (with a few minor reductions for simplicity), of the podmaster HA implementation alongside a quick smoke test using k8petstore.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/high-availability.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
# This should be the IP address of the load balancer for all masters
|
||||
MASTER_IP=<insert-ip-here>
|
||||
# This should be the internal service IP address reserved for DNS
|
||||
DNS_IP=<insert-dns-ip-here>
|
||||
|
||||
DAEMON_ARGS="$DAEMON_ARGS --api-servers=https://${MASTER_IP} --enable-debugging-handlers=true --cloud-provider=
|
||||
gce --config=/etc/kubernetes/manifests --allow-privileged=False --v=2 --cluster-dns=${DNS_IP} --cluster-domain=c
|
||||
luster.local --configure-cbr0=true --cgroup-root=/ --system-container=/system "
|
|
@ -0,0 +1,87 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: etcd-server
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- image: gcr.io/google_containers/etcd:2.0.9
|
||||
name: etcd-container
|
||||
command:
|
||||
- /usr/local/bin/etcd
|
||||
- --name
|
||||
- ${NODE_NAME}
|
||||
- --initial-advertise-peer-urls
|
||||
- http://${NODE_IP}:2380
|
||||
- --listen-peer-urls
|
||||
- http://${NODE_IP}:2380
|
||||
- --advertise-client-urls
|
||||
- http://${NODE_IP}:4001
|
||||
- --listen-client-urls
|
||||
- http://127.0.0.1:4001
|
||||
- --data-dir
|
||||
- /var/etcd/data
|
||||
- --discovery
|
||||
- ${DISCOVERY_TOKEN}
|
||||
ports:
|
||||
- containerPort: 2380
|
||||
hostPort: 2380
|
||||
name: serverport
|
||||
- containerPort: 4001
|
||||
hostPort: 4001
|
||||
name: clientport
|
||||
volumeMounts:
|
||||
- mountPath: /var/etcd
|
||||
name: varetcd
|
||||
- mountPath: /etc/ssl
|
||||
name: etcssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/share/ssl
|
||||
name: usrsharessl
|
||||
readOnly: true
|
||||
- mountPath: /var/ssl
|
||||
name: varssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/ssl
|
||||
name: usrssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/lib/ssl
|
||||
name: usrlibssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/local/openssl
|
||||
name: usrlocalopenssl
|
||||
readOnly: true
|
||||
- mountPath: /etc/openssl
|
||||
name: etcopenssl
|
||||
readOnly: true
|
||||
- mountPath: /etc/pki/tls
|
||||
name: etcpkitls
|
||||
readOnly: true
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /var/etcd/data
|
||||
name: varetcd
|
||||
- hostPath:
|
||||
path: /etc/ssl
|
||||
name: etcssl
|
||||
- hostPath:
|
||||
path: /usr/share/ssl
|
||||
name: usrsharessl
|
||||
- hostPath:
|
||||
path: /var/ssl
|
||||
name: varssl
|
||||
- hostPath:
|
||||
path: /usr/ssl
|
||||
name: usrssl
|
||||
- hostPath:
|
||||
path: /usr/lib/ssl
|
||||
name: usrlibssl
|
||||
- hostPath:
|
||||
path: /usr/local/openssl
|
||||
name: usrlocalopenssl
|
||||
- hostPath:
|
||||
path: /etc/openssl
|
||||
name: etcopenssl
|
||||
- hostPath:
|
||||
path: /etc/pki/tls
|
||||
name: etcpkitls
|
After Width: | Height: | Size: 38 KiB |
After Width: | Height: | Size: 453 KiB |
|
@ -0,0 +1,90 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: kube-apiserver
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: kube-apiserver
|
||||
image: gcr.io/google_containers/kube-apiserver:9680e782e08a1a1c94c656190011bd02
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- /usr/local/bin/kube-apiserver --address=127.0.0.1 --etcd-servers=http://127.0.0.1:4001
|
||||
--cloud-provider=gce --admission-control=NamespaceLifecycle,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota
|
||||
--service-cluster-ip-range=10.0.0.0/16 --client-ca-file=/srv/kubernetes/ca.crt
|
||||
--basic-auth-file=/srv/kubernetes/basic_auth.csv --cluster-name=e2e-test-bburns
|
||||
--tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key
|
||||
--secure-port=443 --token-auth-file=/srv/kubernetes/known_tokens.csv --v=2
|
||||
--allow-privileged=False 1>>/var/log/kube-apiserver.log 2>&1
|
||||
ports:
|
||||
- containerPort: 443
|
||||
hostPort: 443
|
||||
name: https
|
||||
- containerPort: 7080
|
||||
hostPort: 7080
|
||||
name: http
|
||||
- containerPort: 8080
|
||||
hostPort: 8080
|
||||
name: local
|
||||
volumeMounts:
|
||||
- mountPath: /srv/kubernetes
|
||||
name: srvkube
|
||||
readOnly: true
|
||||
- mountPath: /var/log/kube-apiserver.log
|
||||
name: logfile
|
||||
- mountPath: /etc/ssl
|
||||
name: etcssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/share/ssl
|
||||
name: usrsharessl
|
||||
readOnly: true
|
||||
- mountPath: /var/ssl
|
||||
name: varssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/ssl
|
||||
name: usrssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/lib/ssl
|
||||
name: usrlibssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/local/openssl
|
||||
name: usrlocalopenssl
|
||||
readOnly: true
|
||||
- mountPath: /etc/openssl
|
||||
name: etcopenssl
|
||||
readOnly: true
|
||||
- mountPath: /etc/pki/tls
|
||||
name: etcpkitls
|
||||
readOnly: true
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /srv/kubernetes
|
||||
name: srvkube
|
||||
- hostPath:
|
||||
path: /var/log/kube-apiserver.log
|
||||
name: logfile
|
||||
- hostPath:
|
||||
path: /etc/ssl
|
||||
name: etcssl
|
||||
- hostPath:
|
||||
path: /usr/share/ssl
|
||||
name: usrsharessl
|
||||
- hostPath:
|
||||
path: /var/ssl
|
||||
name: varssl
|
||||
- hostPath:
|
||||
path: /usr/ssl
|
||||
name: usrssl
|
||||
- hostPath:
|
||||
path: /usr/lib/ssl
|
||||
name: usrlibssl
|
||||
- hostPath:
|
||||
path: /usr/local/openssl
|
||||
name: usrlocalopenssl
|
||||
- hostPath:
|
||||
path: /etc/openssl
|
||||
name: etcopenssl
|
||||
- hostPath:
|
||||
path: /etc/pki/tls
|
||||
name: etcpkitls
|
|
@ -0,0 +1,82 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: kube-controller-manager
|
||||
spec:
|
||||
containers:
|
||||
- command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- /usr/local/bin/kube-controller-manager --master=127.0.0.1:8080 --cluster-name=e2e-test-bburns
|
||||
--cluster-cidr=10.245.0.0/16 --allocate-node-cidrs=true --cloud-provider=gce --service-account-private-key-file=/srv/kubernetes/server.key
|
||||
--v=2 1>>/var/log/kube-controller-manager.log 2>&1
|
||||
image: gcr.io/google_containers/kube-controller-manager:fda24638d51a48baa13c35337fcd4793
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 10252
|
||||
initialDelaySeconds: 15
|
||||
timeoutSeconds: 1
|
||||
name: kube-controller-manager
|
||||
volumeMounts:
|
||||
- mountPath: /srv/kubernetes
|
||||
name: srvkube
|
||||
readOnly: true
|
||||
- mountPath: /var/log/kube-controller-manager.log
|
||||
name: logfile
|
||||
- mountPath: /etc/ssl
|
||||
name: etcssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/share/ssl
|
||||
name: usrsharessl
|
||||
readOnly: true
|
||||
- mountPath: /var/ssl
|
||||
name: varssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/ssl
|
||||
name: usrssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/lib/ssl
|
||||
name: usrlibssl
|
||||
readOnly: true
|
||||
- mountPath: /usr/local/openssl
|
||||
name: usrlocalopenssl
|
||||
readOnly: true
|
||||
- mountPath: /etc/openssl
|
||||
name: etcopenssl
|
||||
readOnly: true
|
||||
- mountPath: /etc/pki/tls
|
||||
name: etcpkitls
|
||||
readOnly: true
|
||||
hostNetwork: true
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /srv/kubernetes
|
||||
name: srvkube
|
||||
- hostPath:
|
||||
path: /var/log/kube-controller-manager.log
|
||||
name: logfile
|
||||
- hostPath:
|
||||
path: /etc/ssl
|
||||
name: etcssl
|
||||
- hostPath:
|
||||
path: /usr/share/ssl
|
||||
name: usrsharessl
|
||||
- hostPath:
|
||||
path: /var/ssl
|
||||
name: varssl
|
||||
- hostPath:
|
||||
path: /usr/ssl
|
||||
name: usrssl
|
||||
- hostPath:
|
||||
path: /usr/lib/ssl
|
||||
name: usrlibssl
|
||||
- hostPath:
|
||||
path: /usr/local/openssl
|
||||
name: usrlocalopenssl
|
||||
- hostPath:
|
||||
path: /etc/openssl
|
||||
name: etcopenssl
|
||||
- hostPath:
|
||||
path: /etc/pki/tls
|
||||
name: etcpkitls
|
|
@ -0,0 +1,30 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: kube-scheduler
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: kube-scheduler
|
||||
image: gcr.io/google_containers/kube-scheduler:34d0b8f8b31e27937327961528739bc9
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- /usr/local/bin/kube-scheduler --master=127.0.0.1:8080 --v=2 1>>/var/log/kube-scheduler.log
|
||||
2>&1
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 10251
|
||||
initialDelaySeconds: 15
|
||||
timeoutSeconds: 1
|
||||
volumeMounts:
|
||||
- mountPath: /var/log/kube-scheduler.log
|
||||
name: logfile
|
||||
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
|
||||
name: default-token-s8ejd
|
||||
readOnly: true
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /var/log/kube-scheduler.log
|
||||
name: logfile
|
|
@ -0,0 +1,9 @@
|
|||
check process docker with pidfile /var/run/docker.pid
|
||||
group docker
|
||||
start program = "/etc/init.d/docker start"
|
||||
stop program = "/etc/init.d/docker stop"
|
||||
if does not exist then restart
|
||||
if failed
|
||||
unixsocket /var/run/docker.sock
|
||||
protocol HTTP request "/version"
|
||||
then restart
|
|
@ -0,0 +1,11 @@
|
|||
check process kubelet with pidfile /var/run/kubelet.pid
|
||||
group kubelet
|
||||
start program = "/etc/init.d/kubelet start"
|
||||
stop program = "/etc/init.d/kubelet stop"
|
||||
if does not exist then restart
|
||||
if failed
|
||||
host 127.0.0.1
|
||||
port 10255
|
||||
protocol HTTP
|
||||
request "/healthz"
|
||||
then restart
|
|
@ -0,0 +1,43 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: scheduler-master
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: scheduler-elector
|
||||
image: gcr.io/google_containers/podmaster:1.1
|
||||
command:
|
||||
- /podmaster
|
||||
- --etcd-servers=http://127.0.0.1:4001
|
||||
- --key=scheduler
|
||||
- --source-file=/kubernetes/kube-scheduler.manifest
|
||||
- --dest-file=/manifests/kube-scheduler.manifest
|
||||
volumeMounts:
|
||||
- mountPath: /kubernetes
|
||||
name: k8s
|
||||
readOnly: true
|
||||
- mountPath: /manifests
|
||||
name: manifests
|
||||
- name: controller-manager-elector
|
||||
image: gcr.io/google_containers/podmaster:1.1
|
||||
command:
|
||||
- /podmaster
|
||||
- --etcd-servers=http://127.0.0.1:4001
|
||||
- --key=controller
|
||||
- --source-file=/kubernetes/kube-controller-manager.manifest
|
||||
- --dest-file=/manifests/kube-controller-manager.manifest
|
||||
terminationMessagePath: /dev/termination-log
|
||||
volumeMounts:
|
||||
- mountPath: /kubernetes
|
||||
name: k8s
|
||||
readOnly: true
|
||||
- mountPath: /manifests
|
||||
name: manifests
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /srv/kubernetes
|
||||
name: k8s
|
||||
- hostPath:
|
||||
path: /etc/kubernetes/manifests
|
||||
name: manifests
|
|
@ -0,0 +1,58 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Cluster Admin Guide"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Cluster Admin Guide
|
||||
|
||||
The cluster admin guide is for anyone creating or administering a Kubernetes cluster.
|
||||
It assumes some familiarity with concepts in the [User Guide](../user-guide/README.html).
|
||||
|
||||
## Admin Guide Table of Contents
|
||||
|
||||
[Introduction](introduction.html)
|
||||
|
||||
1. [Components of a cluster](cluster-components.html)
|
||||
1. [Cluster Management](cluster-management.html)
|
||||
1. Administrating Master Components
|
||||
1. [The kube-apiserver binary](kube-apiserver.html)
|
||||
1. [Authorization](authorization.html)
|
||||
1. [Authentication](authentication.html)
|
||||
1. [Accessing the api](accessing-the-api.html)
|
||||
1. [Admission Controllers](admission-controllers.html)
|
||||
1. [Administrating Service Accounts](service-accounts-admin.html)
|
||||
1. [Resource Quotas](resource-quota.html)
|
||||
1. [The kube-scheduler binary](kube-scheduler.html)
|
||||
1. [The kube-controller-manager binary](kube-controller-manager.html)
|
||||
1. [Administrating Kubernetes Nodes](node.html)
|
||||
1. [The kubelet binary](kubelet.html)
|
||||
1. [Garbage Collection](garbage-collection.html)
|
||||
1. [The kube-proxy binary](kube-proxy.html)
|
||||
1. Administrating Addons
|
||||
1. [DNS](dns.html)
|
||||
1. [Networking](networking.html)
|
||||
1. [OVS Networking](ovs-networking.html)
|
||||
1. Example Configurations
|
||||
1. [Multiple Clusters](multi-cluster.html)
|
||||
1. [High Availability Clusters](high-availability.html)
|
||||
1. [Large Clusters](cluster-large.html)
|
||||
1. [Getting started from scratch](../getting-started-guides/scratch.html)
|
||||
1. [Kubernetes's use of salt](salt.html)
|
||||
1. [Troubleshooting](cluster-troubleshooting.html)
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,96 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Cluster Admin Guide"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Cluster Admin Guide
|
||||
|
||||
The cluster admin guide is for anyone creating or administering a Kubernetes cluster.
|
||||
It assumes some familiarity with concepts in the [User Guide](../user-guide/README.html).
|
||||
|
||||
## Planning a cluster
|
||||
|
||||
There are many different examples of how to setup a kubernetes cluster. Many of them are listed in this
|
||||
[matrix](../getting-started-guides/README.html). We call each of the combinations in this matrix a *distro*.
|
||||
|
||||
Before choosing a particular guide, here are some things to consider:
|
||||
|
||||
- Are you just looking to try out Kubernetes on your laptop, or build a high-availability many-node cluster? Both
|
||||
models are supported, but some distros are better for one case or the other.
|
||||
- Will you be using a hosted Kubernetes cluster, such as [GKE](https://cloud.google.com/container-engine), or setting
|
||||
one up yourself?
|
||||
- Will your cluster be on-premises, or in the cloud (IaaS)? Kubernetes does not directly support hybrid clusters. We
|
||||
recommend setting up multiple clusters rather than spanning distant locations.
|
||||
- Will you be running Kubernetes on "bare metal" or virtual machines? Kubernetes supports both, via different distros.
|
||||
- Do you just want to run a cluster, or do you expect to do active development of kubernetes project code? If the
|
||||
latter, it is better to pick a distro actively used by other developers. Some distros only use binary releases, but
|
||||
offer is a greater variety of choices.
|
||||
- Not all distros are maintained as actively. Prefer ones which are listed as tested on a more recent version of
|
||||
Kubernetes.
|
||||
- If you are configuring kubernetes on-premises, you will need to consider what [networking
|
||||
model](networking.html) fits best.
|
||||
- If you are designing for very high-availability, you may want [clusters in multiple zones](multi-cluster.html).
|
||||
- You may want to familiarize yourself with the various
|
||||
[components](cluster-components.html) needed to run a cluster.
|
||||
|
||||
## Setting up a cluster
|
||||
|
||||
Pick one of the Getting Started Guides from the [matrix](../getting-started-guides/README.html) and follow it.
|
||||
If none of the Getting Started Guides fits, you may want to pull ideas from several of the guides.
|
||||
|
||||
One option for custom networking is *OpenVSwitch GRE/VxLAN networking* ([ovs-networking.md](ovs-networking.html)), which
|
||||
uses OpenVSwitch to set up networking between pods across
|
||||
Kubernetes nodes.
|
||||
|
||||
If you are modifying an existing guide which uses Salt, this document explains [how Salt is used in the Kubernetes
|
||||
project](salt.html).
|
||||
|
||||
## Managing a cluster, including upgrades
|
||||
|
||||
[Managing a cluster](cluster-management.html).
|
||||
|
||||
## Managing nodes
|
||||
|
||||
[Managing nodes](node.html).
|
||||
|
||||
## Optional Cluster Services
|
||||
|
||||
* **DNS Integration with SkyDNS** ([dns.md](dns.html)):
|
||||
Resolving a DNS name directly to a Kubernetes service.
|
||||
|
||||
* **Logging** with [Kibana](../user-guide/logging.html)
|
||||
|
||||
## Multi-tenant support
|
||||
|
||||
* **Resource Quota** ([resource-quota.md](resource-quota.html))
|
||||
|
||||
## Security
|
||||
|
||||
* **Kubernetes Container Environment** ([docs/user-guide/container-environment.md](../user-guide/container-environment.html)):
|
||||
Describes the environment for Kubelet managed containers on a Kubernetes
|
||||
node.
|
||||
|
||||
* **Securing access to the API Server** [accessing the api](accessing-the-api.html)
|
||||
|
||||
* **Authentication** [authentication](authentication.html)
|
||||
|
||||
* **Authorization** [authorization](authorization.html)
|
||||
|
||||
* **Admission Controllers** [admission_controllers](admission-controllers.html)
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/introduction.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,102 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "kube-apiserver"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## kube-apiserver
|
||||
|
||||
|
||||
|
||||
### Synopsis
|
||||
|
||||
|
||||
The Kubernetes API server validates and configures data
|
||||
for the api objects which include pods, services, replicationcontrollers, and
|
||||
others. The API Server services REST operations and provides the frontend to the
|
||||
cluster's shared state through which all other components interact.
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
kube-apiserver
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
--admission-control="AlwaysAdmit": Ordered list of plug-ins to do admission control of resources into cluster. Comma-delimited list of: AlwaysAdmit, AlwaysDeny, DenyEscalatingExec, DenyExecOnPrivileged, InitialResources, LimitRanger, NamespaceAutoProvision, NamespaceExists, NamespaceLifecycle, ResourceQuota, SecurityContextDeny, ServiceAccount
|
||||
--admission-control-config-file="": File with admission control configuration.
|
||||
--advertise-address=<nil>: The IP address on which to advertise the apiserver to members of the cluster. This address must be reachable by the rest of the cluster. If blank, the --bind-address will be used. If --bind-address is unspecified, the host's default interface will be used.
|
||||
--allow-privileged[=false]: If true, allow privileged containers.
|
||||
--authorization-mode="AlwaysAllow": Ordered list of plug-ins to do authorization on secure port. Comma-delimited list of: AlwaysAllow,AlwaysDeny,ABAC
|
||||
--authorization-policy-file="": File with authorization policy in csv format, used with --authorization-mode=ABAC, on the secure port.
|
||||
--basic-auth-file="": If set, the file that will be used to admit requests to the secure port of the API server via http basic authentication.
|
||||
--bind-address=0.0.0.0: The IP address on which to serve the --read-only-port and --secure-port ports. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank, all interfaces will be used (0.0.0.0).
|
||||
--cert-dir="/var/run/kubernetes": The directory where the TLS certs are located (by default /var/run/kubernetes). If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored.
|
||||
--client-ca-file="": If set, any request presenting a client certificate signed by one of the authorities in the client-ca-file is authenticated with an identity corresponding to the CommonName of the client certificate.
|
||||
--cloud-config="": The path to the cloud provider configuration file. Empty string for no configuration file.
|
||||
--cloud-provider="": The provider for cloud services. Empty string for no provider.
|
||||
--cluster-name="kubernetes": The instance prefix for the cluster
|
||||
--cors-allowed-origins=[]: List of allowed origins for CORS, comma separated. An allowed origin can be a regular expression to support subdomain matching. If this list is empty CORS will not be enabled.
|
||||
--etcd-config="": The config file for the etcd client. Mutually exclusive with -etcd-servers.
|
||||
--etcd-prefix="/registry": The prefix for all resource paths in etcd.
|
||||
--etcd-servers=[]: List of etcd servers to watch (http://ip:port), comma separated. Mutually exclusive with -etcd-config
|
||||
--etcd-servers-overrides=[]: Per-resource etcd servers overrides, comma separated. The individual override format: group/resource#servers, where servers are http://ip:port, semicolon separated.
|
||||
--event-ttl=1h0m0s: Amount of time to retain events. Default 1 hour.
|
||||
--experimental-keystone-url="": If passed, activates the keystone authentication plugin
|
||||
--external-hostname="": The hostname to use when generating externalized URLs for this master (e.g. Swagger API Docs.)
|
||||
--google-json-key="": The Google Cloud Platform Service Account JSON Key to use for authentication.
|
||||
--insecure-bind-address=127.0.0.1: The IP address on which to serve the --insecure-port (set to 0.0.0.0 for all interfaces). Defaults to localhost.
|
||||
--insecure-port=8080: The port on which to serve unsecured, unauthenticated access. Default 8080. It is assumed that firewall rules are set up such that this port is not reachable from outside of the cluster and that port 443 on the cluster's public address is proxied to this port. This is performed by nginx in the default setup.
|
||||
--kubelet-certificate-authority="": Path to a cert. file for the certificate authority.
|
||||
--kubelet-client-certificate="": Path to a client cert file for TLS.
|
||||
--kubelet-client-key="": Path to a client key file for TLS.
|
||||
--kubelet-https[=true]: Use https for kubelet connections
|
||||
--kubelet-port=10250: Kubelet port
|
||||
--kubelet-timeout=5s: Timeout for kubelet operations
|
||||
--log-flush-frequency=5s: Maximum number of seconds between log flushes
|
||||
--long-running-request-regexp="(/|^)((watch|proxy)(/|$)|(logs?|portforward|exec|attach)/?$)": A regular expression matching long running requests which should be excluded from maximum inflight request handling.
|
||||
--master-service-namespace="default": The namespace from which the kubernetes master services should be injected into pods
|
||||
--max-connection-bytes-per-sec=0: If non-zero, throttle each user connection to this number of bytes/sec. Currently only applies to long-running requests
|
||||
--max-requests-inflight=400: The maximum number of requests in flight at a given time. When the server exceeds this, it rejects requests. Zero for no limit.
|
||||
--min-request-timeout=1800: An optional field indicating the minimum number of seconds a handler must keep a request open before timing it out. Currently only honored by the watch request handler, which picks a randomized value above this number as the connection timeout, to spread out load.
|
||||
--oidc-ca-file="": If set, the OpenID server's certificate will be verified by one of the authorities in the oidc-ca-file, otherwise the host's root CA set will be used
|
||||
--oidc-client-id="": The client ID for the OpenID Connect client, must be set if oidc-issuer-url is set
|
||||
--oidc-issuer-url="": The URL of the OpenID issuer, only HTTPS scheme will be accepted. If set, it will be used to verify the OIDC JSON Web Token (JWT)
|
||||
--oidc-username-claim="sub": The OpenID claim to use as the user name. Note that claims other than the default ('sub') is not guaranteed to be unique and immutable. This flag is experimental, please see the authentication documentation for further details.
|
||||
--profiling[=true]: Enable profiling via web interface host:port/debug/pprof/
|
||||
--runtime-config=: A set of key=value pairs that describe runtime configuration that may be passed to apiserver. apis/<groupVersion> key can be used to turn on/off specific api versions. apis/<groupVersion>/<resource> can be used to turn on/off specific resources. api/all and api/legacy are special keys to control all and legacy api versions respectively.
|
||||
--secure-port=6443: The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all.
|
||||
--service-account-key-file="": File containing PEM-encoded x509 RSA private or public key, used to verify ServiceAccount tokens. If unspecified, --tls-private-key-file is used.
|
||||
--service-account-lookup[=false]: If true, validate ServiceAccount tokens exist in etcd as part of authentication.
|
||||
--service-cluster-ip-range=<nil>: A CIDR notation IP range from which to assign service cluster IPs. This must not overlap with any IP ranges assigned to nodes for pods.
|
||||
--service-node-port-range=: A port range to reserve for services with NodePort visibility. Example: '30000-32767'. Inclusive at both ends of the range.
|
||||
--ssh-keyfile="": If non-empty, use secure SSH proxy to the nodes, using this user keyfile
|
||||
--ssh-user="": If non-empty, use secure SSH proxy to the nodes, using this user name
|
||||
--storage-versions="extensions/v1beta1,v1": The versions to store resources with. Different groups may be stored in different versions. Specified in the format "group1/version1,group2/version2...". This flag expects a complete list of storage versions of ALL groups registered in the server. It defaults to a list of preferred versions of all registered groups, which is derived from the KUBE_API_VERSIONS environment variable.
|
||||
--tls-cert-file="": File containing x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to /var/run/kubernetes.
|
||||
--tls-private-key-file="": File containing x509 private key matching --tls-cert-file.
|
||||
--token-auth-file="": If set, the file that will be used to secure the secure port of the API server via token authentication.
|
||||
--watch-cache[=true]: Enable watch caching in the apiserver
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
###### Auto generated by spf13/cobra at 2015-10-29 20:12:33.554980405 +0000 UTC
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/kube-apiserver.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,89 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "kube-controller-manager"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## kube-controller-manager
|
||||
|
||||
|
||||
|
||||
### Synopsis
|
||||
|
||||
|
||||
The Kubernetes controller manager is a daemon that embeds
|
||||
the core control loops shipped with Kubernetes. In applications of robotics and
|
||||
automation, a control loop is a non-terminating loop that regulates the state of
|
||||
the system. In Kubernetes, a controller is a control loop that watches the shared
|
||||
state of the cluster through the apiserver and makes changes attempting to move the
|
||||
current state towards the desired state. Examples of controllers that ship with
|
||||
Kubernetes today are the replication controller, endpoints controller, namespace
|
||||
controller, and serviceaccounts controller.
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
kube-controller-manager
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
--address=127.0.0.1: The IP address to serve on (set to 0.0.0.0 for all interfaces)
|
||||
--allocate-node-cidrs[=false]: Should CIDRs for Pods be allocated and set on the cloud provider.
|
||||
--cloud-config="": The path to the cloud provider configuration file. Empty string for no configuration file.
|
||||
--cloud-provider="": The provider for cloud services. Empty string for no provider.
|
||||
--cluster-cidr=<nil>: CIDR Range for Pods in cluster.
|
||||
--cluster-name="kubernetes": The instance prefix for the cluster
|
||||
--concurrent-endpoint-syncs=5: The number of endpoint syncing operations that will be done concurrently. Larger number = faster endpoint updating, but more CPU (and network) load
|
||||
--concurrent_rc_syncs=5: The number of replication controllers that are allowed to sync concurrently. Larger number = more reponsive replica management, but more CPU (and network) load
|
||||
--deleting-pods-burst=10: Number of nodes on which pods are bursty deleted in case of node failure. For more details look into RateLimiter.
|
||||
--deleting-pods-qps=0.1: Number of nodes per second on which pods are deleted in case of node failure.
|
||||
--deployment-controller-sync-period=30s: Period for syncing the deployments.
|
||||
--google-json-key="": The Google Cloud Platform Service Account JSON Key to use for authentication.
|
||||
--horizontal-pod-autoscaler-sync-period=30s: The period for syncing the number of pods in horizontal pod autoscaler.
|
||||
--kubeconfig="": Path to kubeconfig file with authorization and master location information.
|
||||
--log-flush-frequency=5s: Maximum number of seconds between log flushes
|
||||
--master="": The address of the Kubernetes API server (overrides any value in kubeconfig)
|
||||
--min-resync-period=12h0m0s: The resync period in reflectors will be random between MinResyncPeriod and 2*MinResyncPeriod
|
||||
--namespace-sync-period=5m0s: The period for syncing namespace life-cycle updates
|
||||
--node-monitor-grace-period=40s: Amount of time which we allow running Node to be unresponsive before marking it unhealty. Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status.
|
||||
--node-monitor-period=5s: The period for syncing NodeStatus in NodeController.
|
||||
--node-startup-grace-period=1m0s: Amount of time which we allow starting Node to be unresponsive before marking it unhealty.
|
||||
--node-sync-period=10s: The period for syncing nodes from cloudprovider. Longer periods will result in fewer calls to cloud provider, but may delay addition of new nodes to cluster.
|
||||
--pod-eviction-timeout=5m0s: The grace period for deleting pods on failed nodes.
|
||||
--port=10252: The port that the controller-manager's http service runs on
|
||||
--profiling[=true]: Enable profiling via web interface host:port/debug/pprof/
|
||||
--pv-recycler-increment-timeout-nfs=30: the increment of time added per Gi to ActiveDeadlineSeconds for an NFS scrubber pod
|
||||
--pv-recycler-minimum-timeout-hostpath=60: The minimum ActiveDeadlineSeconds to use for a HostPath Recycler pod. This is for development and testing only and will not work in a multi-node cluster.
|
||||
--pv-recycler-minimum-timeout-nfs=300: The minimum ActiveDeadlineSeconds to use for an NFS Recycler pod
|
||||
--pv-recycler-pod-template-filepath-hostpath="": The file path to a pod definition used as a template for HostPath persistent volume recycling. This is for development and testing only and will not work in a multi-node cluster.
|
||||
--pv-recycler-pod-template-filepath-nfs="": The file path to a pod definition used as a template for NFS persistent volume recycling
|
||||
--pv-recycler-timeout-increment-hostpath=30: the increment of time added per Gi to ActiveDeadlineSeconds for a HostPath scrubber pod. This is for development and testing only and will not work in a multi-node cluster.
|
||||
--pvclaimbinder-sync-period=10s: The period for syncing persistent volumes and persistent volume claims
|
||||
--resource-quota-sync-period=10s: The period for syncing quota usage status in the system
|
||||
--root-ca-file="": If set, this root certificate authority will be included in service account's token secret. This must be a valid PEM-encoded CA bundle.
|
||||
--service-account-private-key-file="": Filename containing a PEM-encoded private RSA key used to sign service account tokens.
|
||||
--service-sync-period=5m0s: The period for syncing services with their external load balancers
|
||||
--terminated-pod-gc-threshold=12500: Number of terminated pods that can exist before the terminated pod garbage collector starts deleting terminated pods. If <= 0, the terminated pod garbage collector is disabled.
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
###### Auto generated by spf13/cobra at 2015-10-29 20:12:25.539938496 +0000 UTC
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/kube-controller-manager.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "kube-proxy"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## kube-proxy
|
||||
|
||||
|
||||
|
||||
### Synopsis
|
||||
|
||||
|
||||
The Kubernetes network proxy runs on each node. This
|
||||
reflects services as defined in the Kubernetes API on each node and can do simple
|
||||
TCP,UDP stream forwarding or round robin TCP,UDP forwarding across a set of backends.
|
||||
Service cluster ips and ports are currently found through Docker-links-compatible
|
||||
environment variables specifying ports opened by the service proxy. There is an optional
|
||||
addon that provides cluster DNS for these cluster IPs. The user must create a service
|
||||
with the apiserver API to configure the proxy.
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
kube-proxy
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
--bind-address=0.0.0.0: The IP address for the proxy server to serve on (set to 0.0.0.0 for all interfaces)
|
||||
--cleanup-iptables[=false]: If true cleanup iptables rules and exit.
|
||||
--google-json-key="": The Google Cloud Platform Service Account JSON Key to use for authentication.
|
||||
--healthz-bind-address=127.0.0.1: The IP address for the health check server to serve on, defaulting to 127.0.0.1 (set to 0.0.0.0 for all interfaces)
|
||||
--healthz-port=10249: The port to bind the health check server. Use 0 to disable.
|
||||
--hostname-override="": If non-empty, will use this string as identification instead of the actual hostname.
|
||||
--iptables-sync-period=30s: How often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.
|
||||
--kubeconfig="": Path to kubeconfig file with authorization information (the master location is set by the master flag).
|
||||
--log-flush-frequency=5s: Maximum number of seconds between log flushes
|
||||
--masquerade-all[=false]: If using the pure iptables proxy, SNAT everything
|
||||
--master="": The address of the Kubernetes API server (overrides any value in kubeconfig)
|
||||
--oom-score-adj=-999: The oom-score-adj value for kube-proxy process. Values must be within the range [-1000, 1000]
|
||||
--proxy-mode="": Which proxy mode to use: 'userspace' (older, stable) or 'iptables' (experimental). If blank, look at the Node object on the Kubernetes API and respect the 'net.experimental.kubernetes.io/proxy-mode' annotation if provided. Otherwise use the best-available proxy (currently userspace, but may change in future versions). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.
|
||||
--proxy-port-range=: Range of host ports (beginPort-endPort, inclusive) that may be consumed in order to proxy service traffic. If unspecified (0-0) then ports will be randomly chosen.
|
||||
--resource-container="/kube-proxy": Absolute name of the resource-only container to create and run the Kube-proxy in (Default: /kube-proxy).
|
||||
--udp-timeout=250ms: How long an idle UDP connection will be kept open (e.g. '250ms', '2s'). Must be greater than 0. Only applicable for proxy-mode=userspace
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
###### Auto generated by spf13/cobra at 2015-10-29 20:12:28.465584706 +0000 UTC
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/kube-proxy.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "kube-scheduler"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## kube-scheduler
|
||||
|
||||
|
||||
|
||||
### Synopsis
|
||||
|
||||
|
||||
The Kubernetes scheduler is a policy-rich, topology-aware,
|
||||
workload-specific function that significantly impacts availability, performance,
|
||||
and capacity. The scheduler needs to take into account individual and collective
|
||||
resource requirements, quality of service requirements, hardware/software/policy
|
||||
constraints, affinity and anti-affinity specifications, data locality, inter-workload
|
||||
interference, deadlines, and so on. Workload-specific requirements will be exposed
|
||||
through the API as necessary.
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
kube-scheduler
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
--address=127.0.0.1: The IP address to serve on (set to 0.0.0.0 for all interfaces)
|
||||
--algorithm-provider="DefaultProvider": The scheduling algorithm provider to use, one of: DefaultProvider
|
||||
--bind-pods-burst=100: Number of bindings per second scheduler is allowed to make during bursts
|
||||
--bind-pods-qps=50: Number of bindings per second scheduler is allowed to continuously make
|
||||
--google-json-key="": The Google Cloud Platform Service Account JSON Key to use for authentication.
|
||||
--kubeconfig="": Path to kubeconfig file with authorization and master location information.
|
||||
--log-flush-frequency=5s: Maximum number of seconds between log flushes
|
||||
--master="": The address of the Kubernetes API server (overrides any value in kubeconfig)
|
||||
--policy-config-file="": File with scheduler policy configuration
|
||||
--port=10251: The port that the scheduler's http service runs on
|
||||
--profiling[=true]: Enable profiling via web interface host:port/debug/pprof/
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
###### Auto generated by spf13/cobra at 2015-10-29 20:12:20.542446971 +0000 UTC
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/kube-scheduler.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,129 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "kubelet"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## kubelet
|
||||
|
||||
|
||||
|
||||
### Synopsis
|
||||
|
||||
|
||||
The kubelet is the primary "node agent" that runs on each
|
||||
node. The kubelet works in terms of a PodSpec. A PodSpec is a YAML or JSON object
|
||||
that describes a pod. The kubelet takes a set of PodSpecs that are provided through
|
||||
various mechanisms (primarily through the apiserver) and ensures that the containers
|
||||
described in those PodSpecs are running and healthy.
|
||||
|
||||
Other than from an PodSpec from the apiserver, there are three ways that a container
|
||||
manifest can be provided to the Kubelet.
|
||||
|
||||
File: Path passed as a flag on the command line. This file is rechecked every 20
|
||||
seconds (configurable with a flag).
|
||||
|
||||
HTTP endpoint: HTTP endpoint passed as a parameter on the command line. This endpoint
|
||||
is checked every 20 seconds (also configurable with a flag).
|
||||
|
||||
HTTP server: The kubelet can also listen for HTTP and respond to a simple API
|
||||
(underspec'd currently) to submit a new manifest.
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
kubelet
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
--address=0.0.0.0: The IP address for the Kubelet to serve on (set to 0.0.0.0 for all interfaces)
|
||||
--allow-privileged[=false]: If true, allow containers to request privileged mode. [default=false]
|
||||
--api-servers=[]: List of Kubernetes API servers for publishing events, and reading pods and services. (ip:port), comma separated.
|
||||
--cadvisor-port=4194: The port of the localhost cAdvisor endpoint
|
||||
--cert-dir="/var/run/kubernetes": The directory where the TLS certs are located (by default /var/run/kubernetes). If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored.
|
||||
--cgroup-root="": Optional root cgroup to use for pods. This is handled by the container runtime on a best effort basis. Default: '', which means use the container runtime default.
|
||||
--chaos-chance=0: If > 0.0, introduce random client errors and latency. Intended for testing. [default=0.0]
|
||||
--cloud-config="": The path to the cloud provider configuration file. Empty string for no configuration file.
|
||||
--cloud-provider="": The provider for cloud services. Empty string for no provider.
|
||||
--cluster-dns=<nil>: IP address for a cluster DNS server. If set, kubelet will configure all containers to use this for DNS resolution in addition to the host's DNS servers
|
||||
--cluster-domain="": Domain for this cluster. If set, kubelet will configure all containers to search this domain in addition to the host's search domains
|
||||
--config="": Path to the config file or directory of files
|
||||
--configure-cbr0[=false]: If true, kubelet will configure cbr0 based on Node.Spec.PodCIDR.
|
||||
--container-runtime="docker": The container runtime to use. Possible values: 'docker', 'rkt'. Default: 'docker'.
|
||||
--containerized[=false]: Experimental support for running kubelet in a container. Intended for testing. [default=false]
|
||||
--cpu-cfs-quota[=false]: Enable CPU CFS quota enforcement for containers that specify CPU limits
|
||||
--docker-endpoint="": If non-empty, use this for the docker endpoint to communicate with
|
||||
--docker-exec-handler="native": Handler to use when executing a command in a container. Valid values are 'native' and 'nsenter'. Defaults to 'native'.
|
||||
--enable-debugging-handlers[=true]: Enables server endpoints for log collection and local running of containers and commands
|
||||
--enable-server[=true]: Enable the Kubelet's server
|
||||
--event-burst=0: Maximum size of a bursty event records, temporarily allows event records to burst to this number, while still not exceeding event-qps. Only used if --event-qps > 0
|
||||
--event-qps=0: If > 0, limit event creations per second to this value. If 0, unlimited. [default=0.0]
|
||||
--file-check-frequency=20s: Duration between checking config files for new data
|
||||
--google-json-key="": The Google Cloud Platform Service Account JSON Key to use for authentication.
|
||||
--healthz-bind-address=127.0.0.1: The IP address for the healthz server to serve on, defaulting to 127.0.0.1 (set to 0.0.0.0 for all interfaces)
|
||||
--healthz-port=10248: The port of the localhost healthz endpoint
|
||||
--host-ipc-sources="*": Comma-separated list of sources from which the Kubelet allows pods to use the host ipc namespace. [default="*"]
|
||||
--host-network-sources="*": Comma-separated list of sources from which the Kubelet allows pods to use of host network. [default="*"]
|
||||
--host-pid-sources="*": Comma-separated list of sources from which the Kubelet allows pods to use the host pid namespace. [default="*"]
|
||||
--hostname-override="": If non-empty, will use this string as identification instead of the actual hostname.
|
||||
--http-check-frequency=20s: Duration between checking http for new data
|
||||
--image-gc-high-threshold=90: The percent of disk usage after which image garbage collection is always run. Default: 90%%
|
||||
--image-gc-low-threshold=80: The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Default: 80%%
|
||||
--kubeconfig="/var/lib/kubelet/kubeconfig": Path to a kubeconfig file, specifying how to authenticate to API server (the master location is set by the api-servers flag).
|
||||
--log-flush-frequency=5s: Maximum number of seconds between log flushes
|
||||
--low-diskspace-threshold-mb=256: The absolute free disk space, in MB, to maintain. When disk space falls below this threshold, new pods would be rejected. Default: 256
|
||||
--manifest-url="": URL for accessing the container manifest
|
||||
--manifest-url-header="": HTTP header to use when accessing the manifest URL, with the key separated from the value with a ':', as in 'key:value'
|
||||
--master-service-namespace="default": The namespace from which the kubernetes master services should be injected into pods
|
||||
--max-open-files=1000000: Number of files that can be opened by Kubelet process. [default=1000000]
|
||||
--max-pods=40: Number of Pods that can run on this Kubelet.
|
||||
--maximum-dead-containers=100: Maximum number of old instances of containers to retain globally. Each container takes up some disk space. Default: 100.
|
||||
--maximum-dead-containers-per-container=2: Maximum number of old instances to retain per container. Each container takes up some disk space. Default: 2.
|
||||
--minimum-container-ttl-duration=1m0s: Minimum age for a finished container before it is garbage collected. Examples: '300ms', '10s' or '2h45m'
|
||||
--network-plugin="": <Warning: Alpha feature> The name of the network plugin to be invoked for various events in kubelet/pod lifecycle
|
||||
--network-plugin-dir="/usr/libexec/kubernetes/kubelet-plugins/net/exec/": <Warning: Alpha feature> The full path of the directory in which to search for network plugins
|
||||
--node-status-update-frequency=10s: Specifies how often kubelet posts node status to master. Note: be cautious when changing the constant, it must work with nodeMonitorGracePeriod in nodecontroller. Default: 10s
|
||||
--oom-score-adj=-999: The oom-score-adj value for kubelet process. Values must be within the range [-1000, 1000]
|
||||
--pod-cidr="": The CIDR to use for pod IP addresses, only used in standalone mode. In cluster mode, this is obtained from the master.
|
||||
--pod-infra-container-image="gcr.io/google_containers/pause:0.8.0": The image whose network/ipc namespaces containers in each pod will use.
|
||||
--port=10250: The port for the Kubelet to serve on. Note that "kubectl logs" will not work if you set this flag.
|
||||
--read-only-port=10255: The read-only port for the Kubelet to serve on (set to 0 to disable)
|
||||
--really-crash-for-testing[=false]: If true, when panics occur crash. Intended for testing.
|
||||
--register-node[=true]: Register the node with the apiserver (defaults to true if --api-servers is set)
|
||||
--registry-burst=10: Maximum size of a bursty pulls, temporarily allows pulls to burst to this number, while still not exceeding registry-qps. Only used if --registry-qps > 0
|
||||
--registry-qps=0: If > 0, limit registry pull QPS to this value. If 0, unlimited. [default=0.0]
|
||||
--resolv-conf="/etc/resolv.conf": Resolver configuration file used as the basis for the container DNS resolution configuration.
|
||||
--resource-container="/kubelet": Absolute name of the resource-only container to create and run the Kubelet in (Default: /kubelet).
|
||||
--rkt-path="": Path of rkt binary. Leave empty to use the first rkt in $PATH. Only used if --container-runtime='rkt'
|
||||
--rkt-stage1-image="": image to use as stage1. Local paths and http/https URLs are supported. If empty, the 'stage1.aci' in the same directory as '--rkt-path' will be used
|
||||
--root-dir="/var/lib/kubelet": Directory path for managing kubelet files (volume mounts,etc).
|
||||
--runonce[=false]: If true, exit after spawning pods from local manifests or remote urls. Exclusive with --api-servers, and --enable-server
|
||||
--serialize-image-pulls[=true]: Pull images one at a time. We recommend *not* changing the default value on nodes that run docker daemon with version < 1.9 or an Aufs storage backend. Issue #10959 has more details. [default=true]
|
||||
--streaming-connection-idle-timeout=0: Maximum time a streaming connection can be idle before the connection is automatically closed. Example: '5m'
|
||||
--sync-frequency=10s: Max period between synchronizing running containers and config
|
||||
--system-container="": Optional resource-only container in which to place all non-kernel processes that are not already in a container. Empty for no container. Rolling back the flag requires a reboot. (Default: "").
|
||||
--tls-cert-file="": File containing x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory passed to --cert-dir.
|
||||
--tls-private-key-file="": File containing x509 private key matching --tls-cert-file.
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
###### Auto generated by spf13/cobra at 2015-10-29 20:12:15.480131233 +0000 UTC
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/kubelet.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,236 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Limit Range"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
Limit Range
|
||||
========================================
|
||||
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
|
||||
system will be able to consume as much CPU and memory on the node that executes the pod.
|
||||
|
||||
Users may want to impose restrictions on the amount of resource a single pod in the system may consume
|
||||
for a variety of reasons.
|
||||
|
||||
For example:
|
||||
|
||||
1. Each node in the cluster has 2GB of memory. The cluster operator does not want to accept pods
|
||||
that require more than 2GB of memory since no node in the cluster can support the requirement. To prevent a
|
||||
pod from being permanently unscheduled to a node, the operator instead chooses to reject pods that exceed 2GB
|
||||
of memory as part of admission control.
|
||||
2. A cluster is shared by two communities in an organization that runs production and development workloads
|
||||
respectively. Production workloads may consume up to 8GB of memory, but development workloads may consume up
|
||||
to 512MB of memory. The cluster operator creates a separate namespace for each workload, and applies limits to
|
||||
each namespace.
|
||||
3. Users may create a pod which consumes resources just below the capacity of a machine. The left over space
|
||||
may be too small to be useful, but big enough for the waste to be costly over the entire cluster. As a result,
|
||||
the cluster operator may want to set limits that a pod must consume at least 20% of the memory and cpu of their
|
||||
average node size in order to provide for more uniform scheduling and to limit waste.
|
||||
|
||||
This example demonstrates how limits can be applied to a Kubernetes namespace to control
|
||||
min/max resource limits per pod. In addition, this example demonstrates how you can
|
||||
apply default resource limits to pods in the absence of an end-user specified value.
|
||||
|
||||
See [LimitRange design doc](../../design/admission_control_limit_range.html) for more information. For a detailed description of the Kubernetes resource model, see [Resources](../../../docs/user-guide/compute-resources.html)
|
||||
|
||||
Step 0: Prerequisites
|
||||
-----------------------------------------
|
||||
This example requires a running Kubernetes cluster. See the [Getting Started guides](../../../docs/getting-started-guides/) for how to get started.
|
||||
|
||||
Change to the `<kubernetes>` directory if you're not already there.
|
||||
|
||||
Step 1: Create a namespace
|
||||
-----------------------------------------
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called limit-example:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/namespace.yaml
|
||||
namespace "limit-example" created
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS AGE
|
||||
default <none> Active 5m
|
||||
limit-example <none> Active 53s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Step 2: Apply a limit to the namespace
|
||||
-----------------------------------------
|
||||
Let's create a simple limit in our namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
|
||||
limitrange "mylimits" created
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Let's describe the limits that we have imposed in our namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl describe limits mylimits --namespace=limit-example
|
||||
Name: mylimits
|
||||
Namespace: limit-example
|
||||
Type Resource Min Max Request Limit Limit/Request
|
||||
---- -------- --- --- ------- ----- -------------
|
||||
Pod cpu 200m 2 - - -
|
||||
Pod memory 6Mi 1Gi - - -
|
||||
Container cpu 100m 2 200m 300m -
|
||||
Container memory 3Mi 1Gi 100Mi 200Mi -
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
In this scenario, we have said the following:
|
||||
|
||||
1. If a max constraint is specified for a resource (2 CPU and 1Gi memory in this case), then a limit
|
||||
must be specified for that resource across all containers. Failure to specify a limit will result in
|
||||
a validation error when attempting to create the pod. Note that a default value of limit is set by
|
||||
*default* in file `limits.yaml` (300m CPU and 200Mi memory).
|
||||
2. If a min constraint is specified for a resource (100m CPU and 3Mi memory in this case), then a
|
||||
request must be specified for that resource across all containers. Failure to specify a request will
|
||||
result in a validation error when attempting to create the pod. Note that a default value of request is
|
||||
set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
|
||||
3. For any pod, the sum of all containers memory requests must be >= 6Mi and the sum of all containers
|
||||
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
|
||||
containers CPU limits must be <= 2.
|
||||
|
||||
Step 3: Enforcing limits at point of creation
|
||||
-----------------------------------------
|
||||
The limits enumerated in a namespace are only enforced when a pod is created or updated in
|
||||
the cluster. If you change the limits to a different value range, it does not affect pods that
|
||||
were previously created in a namespace.
|
||||
|
||||
If a resource (cpu or memory) is being restricted by a limit, the user will get an error at time
|
||||
of creation explaining why.
|
||||
|
||||
Let's first spin up a replication controller that creates a single container pod to demonstrate
|
||||
how default values are applied to each pod.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
|
||||
replicationcontroller "nginx" created
|
||||
$ kubectl get pods --namespace=limit-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-aq0mf 1/1 Running 0 35s
|
||||
$ kubectl get pods nginx-aq0mf --namespace=limit-example -o yaml | grep resources -C 8
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
resourceVersion: "127"
|
||||
selfLink: /api/v1/namespaces/limit-example/pods/nginx-aq0mf
|
||||
uid: 51be42a7-7156-11e5-9921-286ed488f785
|
||||
spec:
|
||||
containers:
|
||||
- image: nginx
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: nginx
|
||||
resources:
|
||||
limits:
|
||||
cpu: 300m
|
||||
memory: 200Mi
|
||||
requests:
|
||||
cpu: 200m
|
||||
memory: 100Mi
|
||||
terminationMessagePath: /dev/termination-log
|
||||
volumeMounts:
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Note that our nginx container has picked up the namespace default cpu and memory resource *limits* and *requests*.
|
||||
|
||||
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 cpu cores.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
|
||||
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Let's create a pod that falls within the allowed limit boundaries.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
|
||||
pod "valid-pod" created
|
||||
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
uid: 162a12aa-7157-11e5-9921-286ed488f785
|
||||
spec:
|
||||
containers:
|
||||
- image: gcr.io/google_containers/serve_hostname
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: kubernetes-serve-hostname
|
||||
resources:
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
|
||||
default values.
|
||||
|
||||
Note: The *limits* for CPU resource are not enforced in the default Kubernetes setup on the physical node
|
||||
that runs the container unless the administrator deploys the kubelet with the folllowing flag:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
$ kubelet --help
|
||||
Usage of kubelet
|
||||
....
|
||||
--cpu-cfs-quota[=false]: Enable CPU CFS quota enforcement for containers that specify CPU limits
|
||||
$ kubelet --cpu-cfs-quota=true ...
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Step 4: Cleanup
|
||||
----------------------------
|
||||
To remove the resources used by this example, you can just delete the limit-example namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl delete namespace limit-example
|
||||
namespace "limit-example" deleted
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS AGE
|
||||
default <none> Active 20m
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Summary
|
||||
----------------------------
|
||||
Cluster operators that want to restrict the amount of resources a single container or pod may consume
|
||||
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
|
||||
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to
|
||||
constrain the amount of resource a pod consumes on a node.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/limitrange/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,236 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Limit Range"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
Limit Range
|
||||
========================================
|
||||
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
|
||||
system will be able to consume as much CPU and memory on the node that executes the pod.
|
||||
|
||||
Users may want to impose restrictions on the amount of resource a single pod in the system may consume
|
||||
for a variety of reasons.
|
||||
|
||||
For example:
|
||||
|
||||
1. Each node in the cluster has 2GB of memory. The cluster operator does not want to accept pods
|
||||
that require more than 2GB of memory since no node in the cluster can support the requirement. To prevent a
|
||||
pod from being permanently unscheduled to a node, the operator instead chooses to reject pods that exceed 2GB
|
||||
of memory as part of admission control.
|
||||
2. A cluster is shared by two communities in an organization that runs production and development workloads
|
||||
respectively. Production workloads may consume up to 8GB of memory, but development workloads may consume up
|
||||
to 512MB of memory. The cluster operator creates a separate namespace for each workload, and applies limits to
|
||||
each namespace.
|
||||
3. Users may create a pod which consumes resources just below the capacity of a machine. The left over space
|
||||
may be too small to be useful, but big enough for the waste to be costly over the entire cluster. As a result,
|
||||
the cluster operator may want to set limits that a pod must consume at least 20% of the memory and cpu of their
|
||||
average node size in order to provide for more uniform scheduling and to limit waste.
|
||||
|
||||
This example demonstrates how limits can be applied to a Kubernetes namespace to control
|
||||
min/max resource limits per pod. In addition, this example demonstrates how you can
|
||||
apply default resource limits to pods in the absence of an end-user specified value.
|
||||
|
||||
See [LimitRange design doc](../../design/admission_control_limit_range.html) for more information. For a detailed description of the Kubernetes resource model, see [Resources](../../../docs/user-guide/compute-resources.html)
|
||||
|
||||
Step 0: Prerequisites
|
||||
-----------------------------------------
|
||||
This example requires a running Kubernetes cluster. See the [Getting Started guides](../../../docs/getting-started-guides/) for how to get started.
|
||||
|
||||
Change to the `<kubernetes>` directory if you're not already there.
|
||||
|
||||
Step 1: Create a namespace
|
||||
-----------------------------------------
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called limit-example:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/namespace.yaml
|
||||
namespace "limit-example" created
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS AGE
|
||||
default <none> Active 5m
|
||||
limit-example <none> Active 53s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Step 2: Apply a limit to the namespace
|
||||
-----------------------------------------
|
||||
Let's create a simple limit in our namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
|
||||
limitrange "mylimits" created
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Let's describe the limits that we have imposed in our namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl describe limits mylimits --namespace=limit-example
|
||||
Name: mylimits
|
||||
Namespace: limit-example
|
||||
Type Resource Min Max Request Limit Limit/Request
|
||||
---- -------- --- --- ------- ----- -------------
|
||||
Pod cpu 200m 2 - - -
|
||||
Pod memory 6Mi 1Gi - - -
|
||||
Container cpu 100m 2 200m 300m -
|
||||
Container memory 3Mi 1Gi 100Mi 200Mi -
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
In this scenario, we have said the following:
|
||||
|
||||
1. If a max constraint is specified for a resource (2 CPU and 1Gi memory in this case), then a limit
|
||||
must be specified for that resource across all containers. Failure to specify a limit will result in
|
||||
a validation error when attempting to create the pod. Note that a default value of limit is set by
|
||||
*default* in file `limits.yaml` (300m CPU and 200Mi memory).
|
||||
2. If a min constraint is specified for a resource (100m CPU and 3Mi memory in this case), then a
|
||||
request must be specified for that resource across all containers. Failure to specify a request will
|
||||
result in a validation error when attempting to create the pod. Note that a default value of request is
|
||||
set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
|
||||
3. For any pod, the sum of all containers memory requests must be >= 6Mi and the sum of all containers
|
||||
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
|
||||
containers CPU limits must be <= 2.
|
||||
|
||||
Step 3: Enforcing limits at point of creation
|
||||
-----------------------------------------
|
||||
The limits enumerated in a namespace are only enforced when a pod is created or updated in
|
||||
the cluster. If you change the limits to a different value range, it does not affect pods that
|
||||
were previously created in a namespace.
|
||||
|
||||
If a resource (cpu or memory) is being restricted by a limit, the user will get an error at time
|
||||
of creation explaining why.
|
||||
|
||||
Let's first spin up a replication controller that creates a single container pod to demonstrate
|
||||
how default values are applied to each pod.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
|
||||
replicationcontroller "nginx" created
|
||||
$ kubectl get pods --namespace=limit-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-aq0mf 1/1 Running 0 35s
|
||||
$ kubectl get pods nginx-aq0mf --namespace=limit-example -o yaml | grep resources -C 8
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
resourceVersion: "127"
|
||||
selfLink: /api/v1/namespaces/limit-example/pods/nginx-aq0mf
|
||||
uid: 51be42a7-7156-11e5-9921-286ed488f785
|
||||
spec:
|
||||
containers:
|
||||
- image: nginx
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: nginx
|
||||
resources:
|
||||
limits:
|
||||
cpu: 300m
|
||||
memory: 200Mi
|
||||
requests:
|
||||
cpu: 200m
|
||||
memory: 100Mi
|
||||
terminationMessagePath: /dev/termination-log
|
||||
volumeMounts:
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Note that our nginx container has picked up the namespace default cpu and memory resource *limits* and *requests*.
|
||||
|
||||
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 cpu cores.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
|
||||
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Let's create a pod that falls within the allowed limit boundaries.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
|
||||
pod "valid-pod" created
|
||||
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
uid: 162a12aa-7157-11e5-9921-286ed488f785
|
||||
spec:
|
||||
containers:
|
||||
- image: gcr.io/google_containers/serve_hostname
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: kubernetes-serve-hostname
|
||||
resources:
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
|
||||
default values.
|
||||
|
||||
Note: The *limits* for CPU resource are not enforced in the default Kubernetes setup on the physical node
|
||||
that runs the container unless the administrator deploys the kubelet with the folllowing flag:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
$ kubelet --help
|
||||
Usage of kubelet
|
||||
....
|
||||
--cpu-cfs-quota[=false]: Enable CPU CFS quota enforcement for containers that specify CPU limits
|
||||
$ kubelet --cpu-cfs-quota=true ...
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Step 4: Cleanup
|
||||
----------------------------
|
||||
To remove the resources used by this example, you can just delete the limit-example namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl delete namespace limit-example
|
||||
namespace "limit-example" deleted
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS AGE
|
||||
default <none> Active 20m
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Summary
|
||||
----------------------------
|
||||
Cluster operators that want to restrict the amount of resources a single container or pod may consume
|
||||
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
|
||||
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to
|
||||
constrain the amount of resource a pod consumes on a node.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/limitrange/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: invalid-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: kubernetes-serve-hostname
|
||||
image: gcr.io/google_containers/serve_hostname
|
||||
resources:
|
||||
limits:
|
||||
cpu: "3"
|
||||
memory: 100Mi
|
|
@ -0,0 +1,26 @@
|
|||
apiVersion: v1
|
||||
kind: LimitRange
|
||||
metadata:
|
||||
name: mylimits
|
||||
spec:
|
||||
limits:
|
||||
- max:
|
||||
cpu: "2"
|
||||
memory: 1Gi
|
||||
min:
|
||||
cpu: 200m
|
||||
memory: 6Mi
|
||||
type: Pod
|
||||
- default:
|
||||
cpu: 300m
|
||||
memory: 200Mi
|
||||
defaultRequest:
|
||||
cpu: 200m
|
||||
memory: 100Mi
|
||||
max:
|
||||
cpu: "2"
|
||||
memory: 1Gi
|
||||
min:
|
||||
cpu: 100m
|
||||
memory: 3Mi
|
||||
type: Container
|
|
@ -0,0 +1,4 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: limit-example
|
|
@ -0,0 +1,14 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: valid-pod
|
||||
labels:
|
||||
name: valid-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: kubernetes-serve-hostname
|
||||
image: gcr.io/google_containers/serve_hostname
|
||||
resources:
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
|
@ -0,0 +1,83 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Considerations for running multiple Kubernetes clusters"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Considerations for running multiple Kubernetes clusters
|
||||
|
||||
You may want to set up multiple Kubernetes clusters, both to
|
||||
have clusters in different regions to be nearer to your users, and to tolerate failures and/or invasive maintenance.
|
||||
This document describes some of the issues to consider when making a decision about doing so.
|
||||
|
||||
Note that at present,
|
||||
Kubernetes does not offer a mechanism to aggregate multiple clusters into a single virtual cluster. However,
|
||||
we [plan to do this in the future](../proposals/federation.html).
|
||||
|
||||
## Scope of a single cluster
|
||||
|
||||
On IaaS providers such as Google Compute Engine or Amazon Web Services, a VM exists in a
|
||||
[zone](https://cloud.google.com/compute/docs/zones) or [availability
|
||||
zone](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).
|
||||
We suggest that all the VMs in a Kubernetes cluster should be in the same availability zone, because:
|
||||
- compared to having a single global Kubernetes cluster, there are fewer single-points of failure
|
||||
- compared to a cluster that spans availability zones, it is easier to reason about the availability properties of a
|
||||
single-zone cluster.
|
||||
- when the Kubernetes developers are designing the system (e.g. making assumptions about latency, bandwidth, or
|
||||
correlated failures) they are assuming all the machines are in a single data center, or otherwise closely connected.
|
||||
|
||||
It is okay to have multiple clusters per availability zone, though on balance we think fewer is better.
|
||||
Reasons to prefer fewer clusters are:
|
||||
- improved bin packing of Pods in some cases with more nodes in one cluster (less resource fragmentation)
|
||||
- reduced operational overhead (though the advantage is diminished as ops tooling and processes matures)
|
||||
- reduced costs for per-cluster fixed resource costs, e.g. apiserver VMs (but small as a percentage
|
||||
of overall cluster cost for medium to large clusters).
|
||||
|
||||
Reasons to have multiple clusters include:
|
||||
- strict security policies requiring isolation of one class of work from another (but, see Partitioning Clusters
|
||||
below).
|
||||
- test clusters to canary new Kubernetes releases or other cluster software.
|
||||
|
||||
## Selecting the right number of clusters
|
||||
|
||||
The selection of the number of Kubernetes clusters may be a relatively static choice, only revisited occasionally.
|
||||
By contrast, the number of nodes in a cluster and the number of pods in a service may be change frequently according to
|
||||
load and growth.
|
||||
|
||||
To pick the number of clusters, first, decide which regions you need to be in to have adequate latency to all your end users, for services that will run
|
||||
on Kubernetes (if you use a Content Distribution Network, the latency requirements for the CDN-hosted content need not
|
||||
be considered). Legal issues might influence this as well. For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions.
|
||||
Call the number of regions to be in `R`.
|
||||
|
||||
Second, decide how many clusters should be able to be unavailable at the same time, while still being available. Call
|
||||
the number that can be unavailable `U`. If you are not sure, then 1 is a fine choice.
|
||||
|
||||
If it is allowable for load-balancing to direct traffic to any region in the event of a cluster failure, then
|
||||
you need `R + U` clusters. If it is not (e.g you want to ensure low latency for all users in the event of a
|
||||
cluster failure), then you need to have `R * U` clusters (`U` in each of `R` regions). In any case, try to put each cluster in a different zone.
|
||||
|
||||
Finally, if any of your clusters would need more than the maximum recommended number of nodes for a Kubernetes cluster, then
|
||||
you may need even more clusters. Kubernetes v1.0 currently supports clusters up to 100 nodes in size, but we are targeting
|
||||
1000-node clusters by early 2016.
|
||||
|
||||
## Working with multiple clusters
|
||||
|
||||
When you have multiple clusters, you would typically create services with the same config in each cluster and put each of those
|
||||
service instances behind a load balancer (AWS Elastic Load Balancer, GCE Forwarding Rule or HTTP Load Balancer) spanning all of them, so that
|
||||
failures of a single cluster are not visible to end users.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/multi-cluster.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,180 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Namespaces"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Namespaces
|
||||
|
||||
## Abstract
|
||||
|
||||
A Namespace is a mechanism to partition resources created by users into
|
||||
a logically named group.
|
||||
|
||||
## Motivation
|
||||
|
||||
A single cluster should be able to satisfy the needs of multiple users or groups of users (henceforth a 'user community').
|
||||
|
||||
Each user community wants to be able to work in isolation from other communities.
|
||||
|
||||
Each user community has its own:
|
||||
|
||||
1. resources (pods, services, replication controllers, etc.)
|
||||
2. policies (who can or cannot perform actions in their community)
|
||||
3. constraints (this community is allowed this much quota, etc.)
|
||||
|
||||
A cluster operator may create a Namespace for each unique user community.
|
||||
|
||||
The Namespace provides a unique scope for:
|
||||
|
||||
1. named resources (to avoid basic naming collisions)
|
||||
2. delegated management authority to trusted users
|
||||
3. ability to limit community resource consumption
|
||||
|
||||
## Use cases
|
||||
|
||||
1. As a cluster operator, I want to support multiple user communities on a single cluster.
|
||||
2. As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users
|
||||
in those communities.
|
||||
3. As a cluster operator, I want to limit the amount of resources each community can consume in order
|
||||
to limit the impact to other communities using the cluster.
|
||||
4. As a cluster user, I want to interact with resources that are pertinent to my user community in
|
||||
isolation of what other user communities are doing on the cluster.
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
Look [here](namespaces/) for an in depth example of namespaces.
|
||||
|
||||
### Viewing namespaces
|
||||
|
||||
You can list the current namespaces in a cluster using:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS
|
||||
default <none> Active
|
||||
kube-system <none> Active
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Kubernetes starts with two initial namespaces:
|
||||
* `default` The default namespace for objects with no other namespace
|
||||
* `kube-system` The namespace for objects created by the Kubernetes system
|
||||
|
||||
You can also get the summary of a specific namespace using:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get namespaces <name>
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Or you can get detailed information with:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl describe namespaces <name>
|
||||
Name: default
|
||||
Labels: <none>
|
||||
Status: Active
|
||||
|
||||
No resource quota.
|
||||
|
||||
Resource Limits
|
||||
Type Resource Min Max Default
|
||||
---- -------- --- --- ---
|
||||
Container cpu - - 100m
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Note that these details show both resource quota (if present) as well as resource limit ranges.
|
||||
|
||||
Resource quota tracks aggregate usage of resources in the *Namespace* and allows cluster operators
|
||||
to define *Hard* resource usage limits that a *Namespace* may consume.
|
||||
|
||||
A limit range defines min/max constraints on the amount of resources a single entity can consume in
|
||||
a *Namespace*.
|
||||
|
||||
See [Admission control: Limit Range](../design/admission_control_limit_range.html)
|
||||
|
||||
A namespace can be in one of two phases:
|
||||
* `Active` the namespace is in use
|
||||
* `Terminating` the namespace is being deleted, and can not be used for new objects
|
||||
|
||||
See the [design doc](../design/namespaces.html#phases) for more details.
|
||||
|
||||
### Creating a new namespace
|
||||
|
||||
To create a new namespace, first create a new YAML file called `my-namespace.yaml` with the contents:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: <insert-namespace-name-here>
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Note that the name of your namespace must be a DNS compatible label.
|
||||
|
||||
More information on the `finalizers` field can be found in the namespace [design doc](../design/namespaces.html#finalizers).
|
||||
|
||||
Then run:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f ./my-namespace.yaml
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Working in namespaces
|
||||
|
||||
See [Setting the namespace for a request](../../docs/user-guide/namespaces.html#setting-the-namespace-for-a-request)
|
||||
and [Setting the namespace preference](../../docs/user-guide/namespaces.html#setting-the-namespace-preference).
|
||||
|
||||
### Deleting a namespace
|
||||
|
||||
You can delete a namespace with
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl delete namespaces <insert-some-namespace-name>
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
**WARNING, this deletes _everything_ under the namespace!**
|
||||
|
||||
This delete is asynchronous, so for a time you will see the namespace in the `Terminating` state.
|
||||
|
||||
## Namespaces and DNS
|
||||
|
||||
When you create a [Service](../../docs/user-guide/services.html), it creates a corresponding [DNS entry](dns.html).
|
||||
This entry is of the form `<service-name>.<namespace-name>.svc.cluster.local`, which means
|
||||
that if a container just uses `<service-name>` it will resolve to the service which
|
||||
is local to a namespace. This is useful for using the same configuration across
|
||||
multiple namespaces such as Development, Staging and Production. If you want to reach
|
||||
across namespaces, you need to use the fully qualified domain name (FQDN).
|
||||
|
||||
## Design
|
||||
|
||||
Details of the design of namespaces in Kubernetes, including a [detailed example](../design/namespaces.html#example-openshift-origin-managing-a-kubernetes-namespace)
|
||||
can be found in the [namespaces design doc](../design/namespaces.html)
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/namespaces.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,302 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Namespaces"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## Kubernetes Namespaces
|
||||
|
||||
Kubernetes _[namespaces](../../../docs/admin/namespaces.html)_ help different projects, teams, or customers to share a Kubernetes cluster.
|
||||
|
||||
It does this by providing the following:
|
||||
|
||||
1. A scope for [Names](../../user-guide/identifiers.html).
|
||||
2. A mechanism to attach authorization and policy to a subsection of the cluster.
|
||||
|
||||
Use of multiple namespaces is optional.
|
||||
|
||||
This example demonstrates how to use Kubernetes namespaces to subdivide your cluster.
|
||||
|
||||
### Step Zero: Prerequisites
|
||||
|
||||
This example assumes the following:
|
||||
|
||||
1. You have an [existing Kubernetes cluster](../../getting-started-guides/).
|
||||
2. You have a basic understanding of Kubernetes _[pods](../../user-guide/pods.html)_, _[services](../../user-guide/services.html)_, and _[replication controllers](../../user-guide/replication-controller.html)_.
|
||||
|
||||
### Step One: Understand the default namespace
|
||||
|
||||
By default, a Kubernetes cluster will instantiate a default namespace when provisioning the cluster to hold the default set of pods,
|
||||
services, and replication controllers used by the cluster.
|
||||
|
||||
Assuming you have a fresh cluster, you can introspect the available namespace's by doing the following:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS
|
||||
default <none>
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Step Two: Create new namespaces
|
||||
|
||||
For this exercise, we will create two additional Kubernetes namespaces to hold our content.
|
||||
|
||||
Let's imagine a scenario where an organization is using a shared Kubernetes cluster for development and production use cases.
|
||||
|
||||
The development team would like to maintain a space in the cluster where they can get a view on the list of pods, services, and replication controllers
|
||||
they use to build and run their application. In this space, Kubernetes resources come and go, and the restrictions on who can or cannot modify resources
|
||||
are relaxed to enable agile development.
|
||||
|
||||
The operations team would like to maintain a space in the cluster where they can enforce strict procedures on who can or cannot manipulate the set of
|
||||
pods, services, and replication controllers that run the production site.
|
||||
|
||||
One pattern this organization could follow is to partition the Kubernetes cluster into two namespaces: development and production.
|
||||
|
||||
Let's create two new namespaces to hold our work.
|
||||
|
||||
Use the file [`namespace-dev.json`](namespace-dev.json) which describes a development namespace:
|
||||
|
||||
<!-- BEGIN MUNGE: EXAMPLE namespace-dev.json -->
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"kind": "Namespace",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "development",
|
||||
"labels": {
|
||||
"name": "development"
|
||||
}
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
[Download example](namespace-dev.json)
|
||||
<!-- END MUNGE: EXAMPLE namespace-dev.json -->
|
||||
|
||||
Create the development namespace using kubectl.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/namespaces/namespace-dev.json
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
And then lets create the production namespace using kubectl.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/namespaces/namespace-prod.json
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
To be sure things are right, let's list all of the namespaces in our cluster.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS
|
||||
default <none> Active
|
||||
development name=development Active
|
||||
production name=production Active
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
### Step Three: Create pods in each namespace
|
||||
|
||||
A Kubernetes namespace provides the scope for pods, services, and replication controllers in the cluster.
|
||||
|
||||
Users interacting with one namespace do not see the content in another namespace.
|
||||
|
||||
To demonstrate this, let's spin up a simple replication controller and pod in the development namespace.
|
||||
|
||||
We first check what is the current context:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
clusters:
|
||||
- cluster:
|
||||
certificate-authority-data: REDACTED
|
||||
server: https://130.211.122.180
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
contexts:
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
current-context: lithe-cocoa-92103_kubernetes
|
||||
kind: Config
|
||||
preferences: {}
|
||||
users:
|
||||
- name: lithe-cocoa-92103_kubernetes
|
||||
user:
|
||||
client-certificate-data: REDACTED
|
||||
client-key-data: REDACTED
|
||||
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
|
||||
- name: lithe-cocoa-92103_kubernetes-basic-auth
|
||||
user:
|
||||
password: h5M0FtUUIflBSdI7
|
||||
username: admin
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The next step is to define a context for the kubectl client to work in each namespace. The value of "cluster" and "user" fields are copied from the current context.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config set-context dev --namespace=development --cluster=lithe-cocoa-92103_kubernetes --user=lithe-cocoa-92103_kubernetes
|
||||
$ kubectl config set-context prod --namespace=production --cluster=lithe-cocoa-92103_kubernetes --user=lithe-cocoa-92103_kubernetes
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The above commands provided two request contexts you can alternate against depending on what namespace you
|
||||
wish to work against.
|
||||
|
||||
Let's switch to operate in the development namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config use-context dev
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
You can verify your current context by doing the following:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config view
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
clusters:
|
||||
- cluster:
|
||||
certificate-authority-data: REDACTED
|
||||
server: https://130.211.122.180
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
contexts:
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
namespace: development
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: dev
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
namespace: production
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: prod
|
||||
current-context: dev
|
||||
kind: Config
|
||||
preferences: {}
|
||||
users:
|
||||
- name: lithe-cocoa-92103_kubernetes
|
||||
user:
|
||||
client-certificate-data: REDACTED
|
||||
client-key-data: REDACTED
|
||||
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
|
||||
- name: lithe-cocoa-92103_kubernetes-basic-auth
|
||||
user:
|
||||
password: h5M0FtUUIflBSdI7
|
||||
username: admin
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
At this point, all requests we make to the Kubernetes cluster from the command line are scoped to the development namespace.
|
||||
|
||||
Let's create some content.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run snowflake --image=kubernetes/serve_hostname --replicas=2
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
We have just created a replication controller whose replica size is 2 that is running the pod called snowflake with a basic container that just serves the hostname.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get rc
|
||||
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
|
||||
snowflake snowflake kubernetes/serve_hostname run=snowflake 2
|
||||
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
snowflake-8w0qn 1/1 Running 0 22s
|
||||
snowflake-jrpzb 1/1 Running 0 22s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
And this is great, developers are able to do what they want, and they do not have to worry about affecting content in the production namespace.
|
||||
|
||||
Let's switch to the production namespace and show how resources in one namespace are hidden from the other.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config use-context prod
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The production namespace should be empty.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get rc
|
||||
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
|
||||
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Production likes to run cattle, so let's create some cattle pods.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run cattle --image=kubernetes/serve_hostname --replicas=5
|
||||
|
||||
$ kubectl get rc
|
||||
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
|
||||
cattle cattle kubernetes/serve_hostname run=cattle 5
|
||||
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
cattle-97rva 1/1 Running 0 12s
|
||||
cattle-i9ojn 1/1 Running 0 12s
|
||||
cattle-qj3yv 1/1 Running 0 12s
|
||||
cattle-yc7vn 1/1 Running 0 12s
|
||||
cattle-zz7ea 1/1 Running 0 12s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
At this point, it should be clear that the resources users create in one namespace are hidden from the other namespace.
|
||||
|
||||
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can provide different
|
||||
authorization rules for each namespace.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/namespaces/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,302 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Namespaces"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## Kubernetes Namespaces
|
||||
|
||||
Kubernetes _[namespaces](../../../docs/admin/namespaces.html)_ help different projects, teams, or customers to share a Kubernetes cluster.
|
||||
|
||||
It does this by providing the following:
|
||||
|
||||
1. A scope for [Names](../../user-guide/identifiers.html).
|
||||
2. A mechanism to attach authorization and policy to a subsection of the cluster.
|
||||
|
||||
Use of multiple namespaces is optional.
|
||||
|
||||
This example demonstrates how to use Kubernetes namespaces to subdivide your cluster.
|
||||
|
||||
### Step Zero: Prerequisites
|
||||
|
||||
This example assumes the following:
|
||||
|
||||
1. You have an [existing Kubernetes cluster](../../getting-started-guides/).
|
||||
2. You have a basic understanding of Kubernetes _[pods](../../user-guide/pods.html)_, _[services](../../user-guide/services.html)_, and _[replication controllers](../../user-guide/replication-controller.html)_.
|
||||
|
||||
### Step One: Understand the default namespace
|
||||
|
||||
By default, a Kubernetes cluster will instantiate a default namespace when provisioning the cluster to hold the default set of pods,
|
||||
services, and replication controllers used by the cluster.
|
||||
|
||||
Assuming you have a fresh cluster, you can introspect the available namespace's by doing the following:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS
|
||||
default <none>
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Step Two: Create new namespaces
|
||||
|
||||
For this exercise, we will create two additional Kubernetes namespaces to hold our content.
|
||||
|
||||
Let's imagine a scenario where an organization is using a shared Kubernetes cluster for development and production use cases.
|
||||
|
||||
The development team would like to maintain a space in the cluster where they can get a view on the list of pods, services, and replication controllers
|
||||
they use to build and run their application. In this space, Kubernetes resources come and go, and the restrictions on who can or cannot modify resources
|
||||
are relaxed to enable agile development.
|
||||
|
||||
The operations team would like to maintain a space in the cluster where they can enforce strict procedures on who can or cannot manipulate the set of
|
||||
pods, services, and replication controllers that run the production site.
|
||||
|
||||
One pattern this organization could follow is to partition the Kubernetes cluster into two namespaces: development and production.
|
||||
|
||||
Let's create two new namespaces to hold our work.
|
||||
|
||||
Use the file [`namespace-dev.json`](namespace-dev.json) which describes a development namespace:
|
||||
|
||||
<!-- BEGIN MUNGE: EXAMPLE namespace-dev.json -->
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"kind": "Namespace",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "development",
|
||||
"labels": {
|
||||
"name": "development"
|
||||
}
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
[Download example](namespace-dev.json)
|
||||
<!-- END MUNGE: EXAMPLE namespace-dev.json -->
|
||||
|
||||
Create the development namespace using kubectl.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/namespaces/namespace-dev.json
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
And then lets create the production namespace using kubectl.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/namespaces/namespace-prod.json
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
To be sure things are right, let's list all of the namespaces in our cluster.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS
|
||||
default <none> Active
|
||||
development name=development Active
|
||||
production name=production Active
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
### Step Three: Create pods in each namespace
|
||||
|
||||
A Kubernetes namespace provides the scope for pods, services, and replication controllers in the cluster.
|
||||
|
||||
Users interacting with one namespace do not see the content in another namespace.
|
||||
|
||||
To demonstrate this, let's spin up a simple replication controller and pod in the development namespace.
|
||||
|
||||
We first check what is the current context:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
clusters:
|
||||
- cluster:
|
||||
certificate-authority-data: REDACTED
|
||||
server: https://130.211.122.180
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
contexts:
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
current-context: lithe-cocoa-92103_kubernetes
|
||||
kind: Config
|
||||
preferences: {}
|
||||
users:
|
||||
- name: lithe-cocoa-92103_kubernetes
|
||||
user:
|
||||
client-certificate-data: REDACTED
|
||||
client-key-data: REDACTED
|
||||
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
|
||||
- name: lithe-cocoa-92103_kubernetes-basic-auth
|
||||
user:
|
||||
password: h5M0FtUUIflBSdI7
|
||||
username: admin
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The next step is to define a context for the kubectl client to work in each namespace. The value of "cluster" and "user" fields are copied from the current context.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config set-context dev --namespace=development --cluster=lithe-cocoa-92103_kubernetes --user=lithe-cocoa-92103_kubernetes
|
||||
$ kubectl config set-context prod --namespace=production --cluster=lithe-cocoa-92103_kubernetes --user=lithe-cocoa-92103_kubernetes
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The above commands provided two request contexts you can alternate against depending on what namespace you
|
||||
wish to work against.
|
||||
|
||||
Let's switch to operate in the development namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config use-context dev
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
You can verify your current context by doing the following:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config view
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
clusters:
|
||||
- cluster:
|
||||
certificate-authority-data: REDACTED
|
||||
server: https://130.211.122.180
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
contexts:
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
namespace: development
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: dev
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: lithe-cocoa-92103_kubernetes
|
||||
- context:
|
||||
cluster: lithe-cocoa-92103_kubernetes
|
||||
namespace: production
|
||||
user: lithe-cocoa-92103_kubernetes
|
||||
name: prod
|
||||
current-context: dev
|
||||
kind: Config
|
||||
preferences: {}
|
||||
users:
|
||||
- name: lithe-cocoa-92103_kubernetes
|
||||
user:
|
||||
client-certificate-data: REDACTED
|
||||
client-key-data: REDACTED
|
||||
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
|
||||
- name: lithe-cocoa-92103_kubernetes-basic-auth
|
||||
user:
|
||||
password: h5M0FtUUIflBSdI7
|
||||
username: admin
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
At this point, all requests we make to the Kubernetes cluster from the command line are scoped to the development namespace.
|
||||
|
||||
Let's create some content.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run snowflake --image=kubernetes/serve_hostname --replicas=2
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
We have just created a replication controller whose replica size is 2 that is running the pod called snowflake with a basic container that just serves the hostname.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get rc
|
||||
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
|
||||
snowflake snowflake kubernetes/serve_hostname run=snowflake 2
|
||||
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
snowflake-8w0qn 1/1 Running 0 22s
|
||||
snowflake-jrpzb 1/1 Running 0 22s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
And this is great, developers are able to do what they want, and they do not have to worry about affecting content in the production namespace.
|
||||
|
||||
Let's switch to the production namespace and show how resources in one namespace are hidden from the other.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl config use-context prod
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The production namespace should be empty.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get rc
|
||||
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
|
||||
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Production likes to run cattle, so let's create some cattle pods.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run cattle --image=kubernetes/serve_hostname --replicas=5
|
||||
|
||||
$ kubectl get rc
|
||||
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
|
||||
cattle cattle kubernetes/serve_hostname run=cattle 5
|
||||
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
cattle-97rva 1/1 Running 0 12s
|
||||
cattle-i9ojn 1/1 Running 0 12s
|
||||
cattle-qj3yv 1/1 Running 0 12s
|
||||
cattle-yc7vn 1/1 Running 0 12s
|
||||
cattle-zz7ea 1/1 Running 0 12s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
At this point, it should be clear that the resources users create in one namespace are hidden from the other namespace.
|
||||
|
||||
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can provide different
|
||||
authorization rules for each namespace.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/namespaces/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
{
|
||||
"kind": "Namespace",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "development",
|
||||
"labels": {
|
||||
"name": "development"
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,10 @@
|
|||
{
|
||||
"kind": "Namespace",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "production",
|
||||
"labels": {
|
||||
"name": "production"
|
||||
}
|
||||
}
|
||||
}
|
|
@ -0,0 +1,223 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Networking in Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Networking in Kubernetes
|
||||
|
||||
**Table of Contents**
|
||||
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||
|
||||
- [Networking in Kubernetes](#networking-in-kubernetes)
|
||||
- [Summary](#summary)
|
||||
- [Docker model](#docker-model)
|
||||
- [Kubernetes model](#kubernetes-model)
|
||||
- [How to achieve this](#how-to-achieve-this)
|
||||
- [Google Compute Engine (GCE)](#google-compute-engine-gce)
|
||||
- [L2 networks and linux bridging](#l2-networks-and-linux-bridging)
|
||||
- [Flannel](#flannel)
|
||||
- [OpenVSwitch](#openvswitch)
|
||||
- [Weave](#weave)
|
||||
- [Calico](#calico)
|
||||
- [Other reading](#other-reading)
|
||||
|
||||
<!-- END MUNGE: GENERATED_TOC -->
|
||||
|
||||
Kubernetes approaches networking somewhat differently than Docker does by
|
||||
default. There are 4 distinct networking problems to solve:
|
||||
1. Highly-coupled container-to-container communications: this is solved by
|
||||
[pods](../user-guide/pods.html) and `localhost` communications.
|
||||
2. Pod-to-Pod communications: this is the primary focus of this document.
|
||||
3. Pod-to-Service communications: this is covered by [services](../user-guide/services.html).
|
||||
4. External-to-Service communications: this is covered by [services](../user-guide/services.html).
|
||||
|
||||
## Summary
|
||||
|
||||
Kubernetes assumes that pods can communicate with other pods, regardless of
|
||||
which host they land on. We give every pod its own IP address so you do not
|
||||
need to explicitly create links between pods and you almost never need to deal
|
||||
with mapping container ports to host ports. This creates a clean,
|
||||
backwards-compatible model where pods can be treated much like VMs or physical
|
||||
hosts from the perspectives of port allocation, naming, service discovery, load
|
||||
balancing, application configuration, and migration.
|
||||
|
||||
To achieve this we must impose some requirements on how you set up your cluster
|
||||
networking.
|
||||
|
||||
## Docker model
|
||||
|
||||
Before discussing the Kubernetes approach to networking, it is worthwhile to
|
||||
review the "normal" way that networking works with Docker. By default, Docker
|
||||
uses host-private networking. It creates a virtual bridge, called `docker0` by
|
||||
default, and allocates a subnet from one of the private address blocks defined
|
||||
in [RFC1918](https://tools.ietf.org/html/rfc1918) for that bridge. For each
|
||||
container that Docker creates, it allocates a virtual ethernet device (called
|
||||
`veth`) which is attached to the bridge. The veth is mapped to appear as `eth0`
|
||||
in the container, using Linux namespaces. The in-container `eth0` interface is
|
||||
given an IP address from the bridge's address range.
|
||||
|
||||
The result is that Docker containers can talk to other containers only if they
|
||||
are on the same machine (and thus the same virtual bridge). Containers on
|
||||
different machines can not reach each other - in fact they may end up with the
|
||||
exact same network ranges and IP addresses.
|
||||
|
||||
In order for Docker containers to communicate across nodes, they must be
|
||||
allocated ports on the machine's own IP address, which are then forwarded or
|
||||
proxied to the containers. This obviously means that containers must either
|
||||
coordinate which ports they use very carefully or else be allocated ports
|
||||
dynamically.
|
||||
|
||||
## Kubernetes model
|
||||
|
||||
Coordinating ports across multiple developers is very difficult to do at
|
||||
scale and exposes users to cluster-level issues outside of their control.
|
||||
Dynamic port allocation brings a lot of complications to the system - every
|
||||
application has to take ports as flags, the API servers have to know how to
|
||||
insert dynamic port numbers into configuration blocks, services have to know
|
||||
how to find each other, etc. Rather than deal with this, Kubernetes takes a
|
||||
different approach.
|
||||
|
||||
Kubernetes imposes the following fundamental requirements on any networking
|
||||
implementation (barring any intentional network segmentation policies):
|
||||
* all containers can communicate with all other containers without NAT
|
||||
* all nodes can communicate with all containers (and vice-versa) without NAT
|
||||
* the IP that a container sees itself as is the same IP that others see it as
|
||||
|
||||
What this means in practice is that you can not just take two computers
|
||||
running Docker and expect Kubernetes to work. You must ensure that the
|
||||
fundamental requirements are met.
|
||||
|
||||
This model is not only less complex overall, but it is principally compatible
|
||||
with the desire for Kubernetes to enable low-friction porting of apps from VMs
|
||||
to containers. If your job previously ran in a VM, your VM had an IP and could
|
||||
talk to other VMs in your project. This is the same basic model.
|
||||
|
||||
Until now this document has talked about containers. In reality, Kubernetes
|
||||
applies IP addresses at the `Pod` scope - containers within a `Pod` share their
|
||||
network namespaces - including their IP address. This means that containers
|
||||
within a `Pod` can all reach each other’s ports on `localhost`. This does imply
|
||||
that containers within a `Pod` must coordinate port usage, but this is no
|
||||
different than processes in a VM. We call this the "IP-per-pod" model. This
|
||||
is implemented in Docker as a "pod container" which holds the network namespace
|
||||
open while "app containers" (the things the user specified) join that namespace
|
||||
with Docker's `--net=container:<id>` function.
|
||||
|
||||
As with Docker, it is possible to request host ports, but this is reduced to a
|
||||
very niche operation. In this case a port will be allocated on the host `Node`
|
||||
and traffic will be forwarded to the `Pod`. The `Pod` itself is blind to the
|
||||
existence or non-existence of host ports.
|
||||
|
||||
## How to achieve this
|
||||
|
||||
There are a number of ways that this network model can be implemented. This
|
||||
document is not an exhaustive study of the various methods, but hopefully serves
|
||||
as an introduction to various technologies and serves as a jumping-off point.
|
||||
If some techniques become vastly preferable to others, we might detail them more
|
||||
here.
|
||||
|
||||
### Google Compute Engine (GCE)
|
||||
|
||||
For the Google Compute Engine cluster configuration scripts, we use [advanced
|
||||
routing](https://developers.google.com/compute/docs/networking#routing) to
|
||||
assign each VM a subnet (default is `/24` - 254 IPs). Any traffic bound for that
|
||||
subnet will be routed directly to the VM by the GCE network fabric. This is in
|
||||
addition to the "main" IP address assigned to the VM, which is NAT'ed for
|
||||
outbound internet access. A linux bridge (called `cbr0`) is configured to exist
|
||||
on that subnet, and is passed to docker's `--bridge` flag.
|
||||
|
||||
We start Docker with:
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
DOCKER_OPTS="--bridge=cbr0 --iptables=false --ip-masq=false"
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
This bridge is created by Kubelet (controlled by the `--configure-cbr0=true`
|
||||
flag) according to the `Node`'s `spec.podCIDR`.
|
||||
|
||||
Docker will now allocate IPs from the `cbr-cidr` block. Containers can reach
|
||||
each other and `Nodes` over the `cbr0` bridge. Those IPs are all routable
|
||||
within the GCE project network.
|
||||
|
||||
GCE itself does not know anything about these IPs, though, so it will not NAT
|
||||
them for outbound internet traffic. To achieve that we use an iptables rule to
|
||||
masquerade (aka SNAT - to make it seem as if packets came from the `Node`
|
||||
itself) traffic that is bound for IPs outside the GCE project network
|
||||
(10.0.0.0/8).
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
iptables -t nat -A POSTROUTING ! -d 10.0.0.0/8 -o eth0 -j MASQUERADE
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Lastly we enable IP forwarding in the kernel (so the kernel will process
|
||||
packets for bridged containers):
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
sysctl net.ipv4.ip_forward=1
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The result of all this is that all `Pods` can reach each other and can egress
|
||||
traffic to the internet.
|
||||
|
||||
### L2 networks and linux bridging
|
||||
|
||||
If you have a "dumb" L2 network, such as a simple switch in a "bare-metal"
|
||||
environment, you should be able to do something similar to the above GCE setup.
|
||||
Note that these instructions have only been tried very casually - it seems to
|
||||
work, but has not been thoroughly tested. If you use this technique and
|
||||
perfect the process, please let us know.
|
||||
|
||||
Follow the "With Linux Bridge devices" section of [this very nice
|
||||
tutorial](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) from
|
||||
Lars Kellogg-Stedman.
|
||||
|
||||
### Flannel
|
||||
|
||||
[Flannel](https://github.com/coreos/flannel#flannel) is a very simple overlay
|
||||
network that satisfies the Kubernetes requirements. It installs in minutes and
|
||||
should get you up and running if the above techniques are not working. Many
|
||||
people have reported success with Flannel and Kubernetes.
|
||||
|
||||
### OpenVSwitch
|
||||
|
||||
[OpenVSwitch](ovs-networking.html) is a somewhat more mature but also
|
||||
complicated way to build an overlay network. This is endorsed by several of the
|
||||
"Big Shops" for networking.
|
||||
|
||||
### Weave
|
||||
|
||||
[Weave](https://github.com/zettio/weave) is yet another way to build an overlay
|
||||
network, primarily aiming at Docker integration.
|
||||
|
||||
### Calico
|
||||
|
||||
[Calico](https://github.com/Metaswitch/calico) uses BGP to enable real container
|
||||
IPs.
|
||||
|
||||
## Other reading
|
||||
|
||||
The early design of the networking model and its rationale, and some future
|
||||
plans are described in more detail in the [networking design
|
||||
document](../design/networking.html).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/networking.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,257 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Node"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Node
|
||||
|
||||
**Table of Contents**
|
||||
<!-- BEGIN MUNGE: GENERATED_TOC -->
|
||||
|
||||
- [Node](#node)
|
||||
- [What is a node?](#what-is-a-node)
|
||||
- [Node Status](#node-status)
|
||||
- [Node Addresses](#node-addresses)
|
||||
- [Node Phase](#node-phase)
|
||||
- [Node Condition](#node-condition)
|
||||
- [Node Capacity](#node-capacity)
|
||||
- [Node Info](#node-info)
|
||||
- [Node Management](#node-management)
|
||||
- [Node Controller](#node-controller)
|
||||
- [Self-Registration of Nodes](#self-registration-of-nodes)
|
||||
- [Manual Node Administration](#manual-node-administration)
|
||||
- [Node capacity](#node-capacity)
|
||||
- [API Object](#api-object)
|
||||
|
||||
<!-- END MUNGE: GENERATED_TOC -->
|
||||
|
||||
## What is a node?
|
||||
|
||||
`Node` is a worker machine in Kubernetes, previously known as `Minion`. Node
|
||||
may be a VM or physical machine, depending on the cluster. Each node has
|
||||
the services necessary to run [Pods](../user-guide/pods.html) and is managed by the master
|
||||
components. The services on a node include docker, kubelet and network proxy. See
|
||||
[The Kubernetes Node](../design/architecture.html#the-kubernetes-node) section in the
|
||||
architecture design doc for more details.
|
||||
|
||||
## Node Status
|
||||
|
||||
Node status describes current status of a node. For now, there are the following
|
||||
pieces of information:
|
||||
|
||||
### Node Addresses
|
||||
|
||||
The usage of these fields varies depending on your cloud provider or bare metal configuration.
|
||||
|
||||
* HostName: Generally not used
|
||||
|
||||
* ExternalIP: Generally the IP address of the node that is externally routable (available from outside the cluster)
|
||||
|
||||
* InternalIP: Generally the IP address of the node that is routable only within the cluster
|
||||
|
||||
|
||||
### Node Phase
|
||||
|
||||
Node Phase is the current lifecycle phase of node, one of `Pending`,
|
||||
`Running` and `Terminated`.
|
||||
|
||||
* Pending: New nodes are created in this state. A node stays in this state until it is configured.
|
||||
|
||||
* Running: Node has been configured and the Kubernetes components are running
|
||||
|
||||
* Terminated: Node has been removed from the cluster. It will not receive any scheduling requests,
|
||||
and any running pods will be removed from the node.
|
||||
|
||||
Node with `Running` phase is necessary but not sufficient requirement for
|
||||
scheduling Pods. For a node to be considered a scheduling candidate, it
|
||||
must have appropriate conditions, see below.
|
||||
|
||||
### Node Condition
|
||||
|
||||
Node Condition describes the conditions of `Running` nodes. Currently the only
|
||||
node condition is Ready. The Status of this condition can be True, False, or
|
||||
Unknown. True means the Kubelet is healthy and ready to accept pods.
|
||||
False means the Kubelet is not healthy and is not accepting pods. Unknown
|
||||
means the Node Controller, which manages node lifecycle and is responsible for
|
||||
setting the Status of the condition, has not heard from the
|
||||
node recently (currently 40 seconds).
|
||||
Node condition is represented as a json object. For example,
|
||||
the following conditions mean the node is in sane state:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
"conditions": [
|
||||
{
|
||||
"kind": "Ready",
|
||||
"status": "True",
|
||||
},
|
||||
]
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
If the Status of the Ready condition
|
||||
is Unknown or False for more than five minutes, then all of the Pods on the node are terminated by the Node Controller.
|
||||
|
||||
### Node Capacity
|
||||
|
||||
Describes the resources available on the node: CPUs, memory and the maximum
|
||||
number of pods that can be scheduled onto the node.
|
||||
|
||||
### Node Info
|
||||
|
||||
General information about the node, for instance kernel version, Kubernetes version
|
||||
(kubelet version, kube-proxy version), docker version (if used), OS name.
|
||||
The information is gathered by Kubelet from the node.
|
||||
|
||||
## Node Management
|
||||
|
||||
Unlike [Pods](../user-guide/pods.html) and [Services](../user-guide/services.html), a Node is not inherently
|
||||
created by Kubernetes: it is either taken from cloud providers like Google Compute Engine,
|
||||
or from your pool of physical or virtual machines. What this means is that when
|
||||
Kubernetes creates a node, it is really just creating an object that represents the node in its internal state.
|
||||
After creation, Kubernetes will check whether the node is valid or not.
|
||||
For example, if you try to create a node from the following content:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"kind": "Node",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "10.240.79.157",
|
||||
"labels": {
|
||||
"name": "my-first-k8s-node"
|
||||
}
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Kubernetes will create a Node object internally (the representation), and
|
||||
validate the node by health checking based on the `metadata.name` field: we
|
||||
assume `metadata.name` can be resolved. If the node is valid, i.e. all necessary
|
||||
services are running, it is eligible to run a Pod; otherwise, it will be
|
||||
ignored for any cluster activity, until it becomes valid. Note that Kubernetes
|
||||
will keep the object for the invalid node unless it is explicitly deleted by the client, and it will keep
|
||||
checking to see if it becomes valid.
|
||||
|
||||
Currently, there are three components that interact with the Kubernetes node interface: Node Controller, Kubelet, and kubectl.
|
||||
|
||||
### Node Controller
|
||||
|
||||
Node controller is a component in Kubernetes master which manages Node
|
||||
objects. It performs two major functions: cluster-wide node synchronization
|
||||
and single node life-cycle management.
|
||||
|
||||
Node controller has a sync loop that creates/deletes Nodes from Kubernetes
|
||||
based on all matching VM instances listed from the cloud provider. The sync period
|
||||
can be controlled via flag `--node-sync-period`. If a new VM instance
|
||||
gets created, Node Controller creates a representation for it. If an existing
|
||||
instance gets deleted, Node Controller deletes the representation. Note however,
|
||||
that Node Controller is unable to provision the node for you, i.e. it won't install
|
||||
any binary; therefore, to
|
||||
join a node to a Kubernetes cluster, you as an admin need to make sure proper services are
|
||||
running in the node. In the future, we plan to automatically provision some node
|
||||
services.
|
||||
|
||||
### Self-Registration of Nodes
|
||||
|
||||
When kubelet flag `--register-node` is true (the default), the kubelet will attempt to
|
||||
register itself with the API server. This is the preferred pattern, used by most distros.
|
||||
|
||||
For self-registration, the kubelet is started with the following options:
|
||||
- `--api-servers=` tells the kubelet the location of the apiserver.
|
||||
- `--kubeconfig` tells kubelet where to find credentials to authenticate itself to the apiserver.
|
||||
- `--cloud-provider=` tells the kubelet how to talk to a cloud provider to read metadata about itself.
|
||||
- `--register-node` tells the kubelet to create its own node resource.
|
||||
|
||||
Currently, any kubelet is authorized to create/modify any node resource, but in practice it only creates/modifies
|
||||
its own. (In the future, we plan to limit authorization to only allow a kubelet to modify its own Node resource.)
|
||||
|
||||
#### Manual Node Administration
|
||||
|
||||
A cluster administrator can create and modify Node objects.
|
||||
|
||||
If the administrator wishes to create node objects manually, set kubelet flag
|
||||
`--register-node=false`.
|
||||
|
||||
The administrator can modify Node resources (regardless of the setting of `--register-node`).
|
||||
Modifications include setting labels on the Node, and marking it unschedulable.
|
||||
|
||||
Labels on nodes can be used in conjunction with node selectors on pods to control scheduling,
|
||||
e.g. to constrain a Pod to only be eligible to run on a subset of the nodes.
|
||||
|
||||
Making a node unscheduleable will prevent new pods from being scheduled to that
|
||||
node, but will not affect any existing pods on the node. This is useful as a
|
||||
preparatory step before a node reboot, etc. For example, to mark a node
|
||||
unschedulable, run this command:
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
kubectl replace nodes 10.1.2.3 --patch='{"apiVersion": "v1", "unschedulable": true}'
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Note that pods which are created by a daemonSet controller bypass the Kubernetes scheduler,
|
||||
and do not respect the unschedulable attribute on a node. The assumption is that daemons belong on
|
||||
the machine even if it is being drained of applications in preparation for a reboot.
|
||||
|
||||
### Node capacity
|
||||
|
||||
The capacity of the node (number of cpus and amount of memory) is part of the node resource.
|
||||
Normally, nodes register themselves and report their capacity when creating the node resource. If
|
||||
you are doing [manual node administration](#manual-node-administration), then you need to set node
|
||||
capacity when adding a node.
|
||||
|
||||
The Kubernetes scheduler ensures that there are enough resources for all the pods on a node. It
|
||||
checks that the sum of the limits of containers on the node is no greater than the node capacity. It
|
||||
includes all containers started by kubelet, but not containers started directly by docker, nor
|
||||
processes not in containers.
|
||||
|
||||
If you want to explicitly reserve resources for non-Pod processes, you can create a placeholder
|
||||
pod. Use the following template:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: resource-reserver
|
||||
spec:
|
||||
containers:
|
||||
- name: sleep-forever
|
||||
image: gcr.io/google_containers/pause:0.8.0
|
||||
resources:
|
||||
limits:
|
||||
cpu: 100m
|
||||
memory: 100Mi
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Set the `cpu` and `memory` values to the amount of resources you want to reserve.
|
||||
Place the file in the manifest directory (`--config=DIR` flag of kubelet). Do this
|
||||
on each kubelet where you want to reserve resources.
|
||||
|
||||
|
||||
## API Object
|
||||
|
||||
Node is a top-level resource in the kubernetes REST API. More details about the
|
||||
API object can be found at: [Node API
|
||||
object](http://kubernetes.io/v1.1/docs/api-reference/v1/definitions.html#_v1_node).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/node.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes OpenVSwitch GRE/VxLAN networking"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes OpenVSwitch GRE/VxLAN networking
|
||||
|
||||
This document describes how OpenVSwitch is used to setup networking between pods across nodes.
|
||||
The tunnel type could be GRE or VxLAN. VxLAN is preferable when large scale isolation needs to be performed within the network.
|
||||
|
||||
![ovs-networking](ovs-networking.png "OVS Networking")
|
||||
|
||||
The vagrant setup in Kubernetes does the following:
|
||||
|
||||
The docker bridge is replaced with a brctl generated linux bridge (kbr0) with a 256 address space subnet. Basically, a node gets 10.244.x.0/24 subnet and docker is configured to use that bridge instead of the default docker0 bridge.
|
||||
|
||||
Also, an OVS bridge is created(obr0) and added as a port to the kbr0 bridge. All OVS bridges across all nodes are linked with GRE tunnels. So, each node has an outgoing GRE tunnel to all other nodes. It does not need to be a complete mesh really, just meshier the better. STP (spanning tree) mode is enabled in the bridges to prevent loops.
|
||||
|
||||
Routing rules enable any 10.244.0.0/16 target to become reachable via the OVS bridge connected with the tunnels.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/ovs-networking.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
After Width: | Height: | Size: 103 KiB |
|
@ -0,0 +1,174 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Resource Quotas"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Resource Quotas
|
||||
|
||||
When several users or teams share a cluster with a fixed number of nodes,
|
||||
there is a concern that one team could use more than its fair share of resources.
|
||||
|
||||
Resource quotas are a tool for administrators to address this concern. Resource quotas
|
||||
work like this:
|
||||
- Different teams work in different namespaces. Currently this is voluntary, but
|
||||
support for making this mandatory via ACLs is planned.
|
||||
- The administrator creates a Resource Quota for each namespace.
|
||||
- Users put compute resource requests on their pods. The sum of all resource requests across
|
||||
all pods in the same namespace must not exceed any hard resource limit in any Resource Quota
|
||||
document for the namespace. Note that we used to verify Resource Quota by taking the sum of
|
||||
resource limits of the pods, but this was altered to use resource requests. Backwards compatibility
|
||||
for those pods previously created is preserved because pods that only specify a resource limit have
|
||||
their resource requests defaulted to match their defined limits. The user is only charged for the
|
||||
resources they request in the Resource Quota versus their limits because the request is the minimum
|
||||
amount of resource guaranteed by the cluster during scheduling. For more information on over commit,
|
||||
see [compute-resources](../user-guide/compute-resources.html).
|
||||
- If creating a pod would cause the namespace to exceed any of the limits specified in the
|
||||
the Resource Quota for that namespace, then the request will fail with HTTP status
|
||||
code `403 FORBIDDEN`.
|
||||
- If quota is enabled in a namespace and the user does not specify *requests* on the pod for each
|
||||
of the resources for which quota is enabled, then the POST of the pod will fail with HTTP
|
||||
status code `403 FORBIDDEN`. Hint: Use the LimitRange admission controller to force default
|
||||
values of *limits* (then resource *requests* would be equal to *limits* by default, see
|
||||
[admission controller](admission-controllers.html)) before the quota is checked to avoid this problem.
|
||||
|
||||
Examples of policies that could be created using namespaces and quotas are:
|
||||
- In a cluster with a capacity of 32 GiB RAM, and 16 cores, let team A use 20 Gib and 10 cores,
|
||||
let B use 10GiB and 4 cores, and hold 2GiB and 2 cores in reserve for future allocation.
|
||||
- Limit the "testing" namespace to using 1 core and 1GiB RAM. Let the "production" namespace
|
||||
use any amount.
|
||||
|
||||
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces,
|
||||
there may be contention for resources. This is handled on a first-come-first-served basis.
|
||||
|
||||
Neither contention nor changes to quota will affect already-running pods.
|
||||
|
||||
## Enabling Resource Quota
|
||||
|
||||
Resource Quota support is enabled by default for many Kubernetes distributions. It is
|
||||
enabled when the apiserver `--admission-control=` flag has `ResourceQuota` as
|
||||
one of its arguments.
|
||||
|
||||
Resource Quota is enforced in a particular namespace when there is a
|
||||
`ResourceQuota` object in that namespace. There should be at most one
|
||||
`ResourceQuota` object in a namespace.
|
||||
|
||||
## Compute Resource Quota
|
||||
|
||||
The total sum of [compute resources](../user-guide/compute-resources.html) requested by pods
|
||||
in a namespace can be limited. The following compute resource types are supported:
|
||||
|
||||
| ResourceName | Description |
|
||||
| ------------ | ----------- |
|
||||
| cpu | Total cpu requests of containers |
|
||||
| memory | Total memory requests of containers
|
||||
|
||||
For example, `cpu` quota sums up the `resources.requests.cpu` fields of every
|
||||
container of every pod in the namespace, and enforces a maximum on that sum.
|
||||
|
||||
## Object Count Quota
|
||||
|
||||
The number of objects of a given type can be restricted. The following types
|
||||
are supported:
|
||||
|
||||
| ResourceName | Description |
|
||||
| ------------ | ----------- |
|
||||
| pods | Total number of pods |
|
||||
| services | Total number of services |
|
||||
| replicationcontrollers | Total number of replication controllers |
|
||||
| resourcequotas | Total number of [resource quotas](admission-controllers.html#resourcequota) |
|
||||
| secrets | Total number of secrets |
|
||||
| persistentvolumeclaims | Total number of [persistent volume claims](../user-guide/persistent-volumes.html#persistentvolumeclaims) |
|
||||
|
||||
For example, `pods` quota counts and enforces a maximum on the number of `pods`
|
||||
created in a single namespace.
|
||||
|
||||
You might want to set a pods quota on a namespace
|
||||
to avoid the case where a user creates many small pods and exhausts the cluster's
|
||||
supply of Pod IPs.
|
||||
|
||||
## Viewing and Setting Quotas
|
||||
|
||||
Kubectl supports creating, updating, and viewing quotas:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl namespace myspace
|
||||
$ cat <<EOF > quota.json
|
||||
{
|
||||
"apiVersion": "v1",
|
||||
"kind": "ResourceQuota",
|
||||
"metadata": {
|
||||
"name": "quota",
|
||||
},
|
||||
"spec": {
|
||||
"hard": {
|
||||
"memory": "1Gi",
|
||||
"cpu": "20",
|
||||
"pods": "10",
|
||||
"services": "5",
|
||||
"replicationcontrollers":"20",
|
||||
"resourcequotas":"1",
|
||||
},
|
||||
}
|
||||
}
|
||||
EOF
|
||||
$ kubectl create -f ./quota.json
|
||||
$ kubectl get quota
|
||||
NAME
|
||||
quota
|
||||
$ kubectl describe quota quota
|
||||
Name: quota
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
cpu 0m 20
|
||||
memory 0 1Gi
|
||||
pods 5 10
|
||||
replicationcontrollers 5 20
|
||||
resourcequotas 1 1
|
||||
services 3 5
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## Quota and Cluster Capacity
|
||||
|
||||
Resource Quota objects are independent of the Cluster Capacity. They are
|
||||
expressed in absolute units. So, if you add nodes to your cluster, this does *not*
|
||||
automatically give each namespace the ability to consume more resources.
|
||||
|
||||
Sometimes more complex policies may be desired, such as:
|
||||
- proportionally divide total cluster resources among several teams.
|
||||
- allow each tenant to grow resource usage as needed, but have a generous
|
||||
limit to prevent accidental resource exhaustion.
|
||||
- detect demand from one namespace, add nodes, and increase quota.
|
||||
|
||||
Such policies could be implemented using ResourceQuota as a building-block, by
|
||||
writing a 'controller' which watches the quota usage and adjusts the quota
|
||||
hard limits of each namespace according to other signals.
|
||||
|
||||
Note that resource quota divides up aggregate cluster resources, but it creates no
|
||||
restrictions around nodes: pods from several namespaces may run on the same node.
|
||||
|
||||
## Example
|
||||
|
||||
See a [detailed example for how to use resource quota](resourcequota/)..
|
||||
|
||||
## Read More
|
||||
|
||||
See [ResourceQuota design doc](../design/admission_control_resource_quota.html) for more information.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/resource-quota.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,197 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Resource Quota"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
Resource Quota
|
||||
========================================
|
||||
This example demonstrates how [resource quota](../../admin/admission-controllers.html#resourcequota) and
|
||||
[limitsranger](../../admin/admission-controllers.html#limitranger) can be applied to a Kubernetes namespace.
|
||||
See [ResourceQuota design doc](../../design/admission_control_resource_quota.html) for more information.
|
||||
|
||||
This example assumes you have a functional Kubernetes setup.
|
||||
|
||||
Step 1: Create a namespace
|
||||
-----------------------------------------
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called quota-example:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
|
||||
namespace "quota-example" created
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS AGE
|
||||
default <none> Active 2m
|
||||
quota-example <none> Active 39s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Step 2: Apply a quota to the namespace
|
||||
-----------------------------------------
|
||||
By default, a pod will run with unbounded CPU and memory requests/limits. This means that any pod in the
|
||||
system will be able to consume as much CPU and memory on the node that executes the pod.
|
||||
|
||||
Users may want to restrict how much of the cluster resources a given namespace may consume
|
||||
across all of its pods in order to manage cluster usage. To do this, a user applies a quota to
|
||||
a namespace. A quota lets the user set hard limits on the total amount of node resources (cpu, memory)
|
||||
and API resources (pods, services, etc.) that a namespace may consume. In term of resources, Kubernetes
|
||||
checks the total resource *requests*, not resource *limits* of all containers/pods in the namespace.
|
||||
|
||||
Let's create a simple quota in our namespace:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example
|
||||
resourcequota "quota" created
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Once your quota is applied to a namespace, the system will restrict any creation of content
|
||||
in the namespace until the quota usage has been calculated. This should happen quickly.
|
||||
|
||||
You can describe your current quota usage to see what resources are being consumed in your
|
||||
namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl describe quota quota --namespace=quota-example
|
||||
Name: quota
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
cpu 0 20
|
||||
memory 0 1Gi
|
||||
persistentvolumeclaims 0 10
|
||||
pods 0 10
|
||||
replicationcontrollers 0 20
|
||||
resourcequotas 1 1
|
||||
secrets 1 10
|
||||
services 0 5
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Step 3: Applying default resource requests and limits
|
||||
-----------------------------------------
|
||||
Pod authors rarely specify resource requests and limits for their pods.
|
||||
|
||||
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
|
||||
cpu and memory by creating an nginx container.
|
||||
|
||||
To demonstrate, lets create a replication controller that runs nginx:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
|
||||
replicationcontroller "nginx" created
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Now let's look at the pods that were created.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
What happened? I have no pods! Let's describe the replication controller to get a view of what is happening.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
kubectl describe rc nginx --namespace=quota-example
|
||||
Name: nginx
|
||||
Namespace: quota-example
|
||||
Image(s): nginx
|
||||
Selector: run=nginx
|
||||
Labels: run=nginx
|
||||
Replicas: 0 current / 1 desired
|
||||
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
|
||||
No volumes.
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Reason Message
|
||||
42s 11s 3 {replication-controller } FailedCreate Error creating: Pod "nginx-" is forbidden: Must make a non-zero request for memory since it is tracked by quota.
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The Kubernetes API server is rejecting the replication controllers requests to create a pod because our pods
|
||||
do not specify any memory usage *request*.
|
||||
|
||||
So let's set some default values for the amount of cpu and memory a pod can consume:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
|
||||
limitrange "limits" created
|
||||
$ kubectl describe limits limits --namespace=quota-example
|
||||
Name: limits
|
||||
Namespace: quota-example
|
||||
Type Resource Min Max Request Limit Limit/Request
|
||||
---- -------- --- --- ------- ----- -------------
|
||||
Container memory - - 256Mi 512Mi -
|
||||
Container cpu - - 100m 200m -
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Now any time a pod is created in this namespace, if it has not specified any resource request/limit, the default
|
||||
amount of cpu and memory per container will be applied, and the request will be used as part of admission control.
|
||||
|
||||
Now that we have applied default resource *request* for our namespace, our replication controller should be able to
|
||||
create its pods.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-fca65 1/1 Running 0 1m
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
And if we print out our quota usage in the namespace:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl describe quota quota --namespace=quota-example
|
||||
Name: quota
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
cpu 100m 20
|
||||
memory 256Mi 1Gi
|
||||
persistentvolumeclaims 0 10
|
||||
pods 1 10
|
||||
replicationcontrollers 1 20
|
||||
resourcequotas 1 1
|
||||
secrets 1 10
|
||||
services 0 5
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
You can now see the pod that was created is consuming explicit amounts of resources (specified by resource *request*),
|
||||
and the usage is being tracked by the Kubernetes system properly.
|
||||
|
||||
Summary
|
||||
----------------------------
|
||||
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined
|
||||
by the namespace quota. The resource consumption is measured by resource *request* in pod specification.
|
||||
|
||||
Any action that consumes those resources can be tweaked, or can pick up namespace level defaults to
|
||||
meet your end goal.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/resourcequota/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,197 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Resource Quota"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
Resource Quota
|
||||
========================================
|
||||
This example demonstrates how [resource quota](../../admin/admission-controllers.html#resourcequota) and
|
||||
[limitsranger](../../admin/admission-controllers.html#limitranger) can be applied to a Kubernetes namespace.
|
||||
See [ResourceQuota design doc](../../design/admission_control_resource_quota.html) for more information.
|
||||
|
||||
This example assumes you have a functional Kubernetes setup.
|
||||
|
||||
Step 1: Create a namespace
|
||||
-----------------------------------------
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called quota-example:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
|
||||
namespace "quota-example" created
|
||||
$ kubectl get namespaces
|
||||
NAME LABELS STATUS AGE
|
||||
default <none> Active 2m
|
||||
quota-example <none> Active 39s
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Step 2: Apply a quota to the namespace
|
||||
-----------------------------------------
|
||||
By default, a pod will run with unbounded CPU and memory requests/limits. This means that any pod in the
|
||||
system will be able to consume as much CPU and memory on the node that executes the pod.
|
||||
|
||||
Users may want to restrict how much of the cluster resources a given namespace may consume
|
||||
across all of its pods in order to manage cluster usage. To do this, a user applies a quota to
|
||||
a namespace. A quota lets the user set hard limits on the total amount of node resources (cpu, memory)
|
||||
and API resources (pods, services, etc.) that a namespace may consume. In term of resources, Kubernetes
|
||||
checks the total resource *requests*, not resource *limits* of all containers/pods in the namespace.
|
||||
|
||||
Let's create a simple quota in our namespace:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example
|
||||
resourcequota "quota" created
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Once your quota is applied to a namespace, the system will restrict any creation of content
|
||||
in the namespace until the quota usage has been calculated. This should happen quickly.
|
||||
|
||||
You can describe your current quota usage to see what resources are being consumed in your
|
||||
namespace.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl describe quota quota --namespace=quota-example
|
||||
Name: quota
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
cpu 0 20
|
||||
memory 0 1Gi
|
||||
persistentvolumeclaims 0 10
|
||||
pods 0 10
|
||||
replicationcontrollers 0 20
|
||||
resourcequotas 1 1
|
||||
secrets 1 10
|
||||
services 0 5
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Step 3: Applying default resource requests and limits
|
||||
-----------------------------------------
|
||||
Pod authors rarely specify resource requests and limits for their pods.
|
||||
|
||||
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
|
||||
cpu and memory by creating an nginx container.
|
||||
|
||||
To demonstrate, lets create a replication controller that runs nginx:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
|
||||
replicationcontroller "nginx" created
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Now let's look at the pods that were created.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
What happened? I have no pods! Let's describe the replication controller to get a view of what is happening.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
kubectl describe rc nginx --namespace=quota-example
|
||||
Name: nginx
|
||||
Namespace: quota-example
|
||||
Image(s): nginx
|
||||
Selector: run=nginx
|
||||
Labels: run=nginx
|
||||
Replicas: 0 current / 1 desired
|
||||
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
|
||||
No volumes.
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Reason Message
|
||||
42s 11s 3 {replication-controller } FailedCreate Error creating: Pod "nginx-" is forbidden: Must make a non-zero request for memory since it is tracked by quota.
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The Kubernetes API server is rejecting the replication controllers requests to create a pod because our pods
|
||||
do not specify any memory usage *request*.
|
||||
|
||||
So let's set some default values for the amount of cpu and memory a pod can consume:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
|
||||
limitrange "limits" created
|
||||
$ kubectl describe limits limits --namespace=quota-example
|
||||
Name: limits
|
||||
Namespace: quota-example
|
||||
Type Resource Min Max Request Limit Limit/Request
|
||||
---- -------- --- --- ------- ----- -------------
|
||||
Container memory - - 256Mi 512Mi -
|
||||
Container cpu - - 100m 200m -
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Now any time a pod is created in this namespace, if it has not specified any resource request/limit, the default
|
||||
amount of cpu and memory per container will be applied, and the request will be used as part of admission control.
|
||||
|
||||
Now that we have applied default resource *request* for our namespace, our replication controller should be able to
|
||||
create its pods.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-fca65 1/1 Running 0 1m
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
And if we print out our quota usage in the namespace:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl describe quota quota --namespace=quota-example
|
||||
Name: quota
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
cpu 100m 20
|
||||
memory 256Mi 1Gi
|
||||
persistentvolumeclaims 0 10
|
||||
pods 1 10
|
||||
replicationcontrollers 1 20
|
||||
resourcequotas 1 1
|
||||
secrets 1 10
|
||||
services 0 5
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
You can now see the pod that was created is consuming explicit amounts of resources (specified by resource *request*),
|
||||
and the usage is being tracked by the Kubernetes system properly.
|
||||
|
||||
Summary
|
||||
----------------------------
|
||||
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined
|
||||
by the namespace quota. The resource consumption is measured by resource *request* in pod specification.
|
||||
|
||||
Any action that consumes those resources can be tweaked, or can pick up namespace level defaults to
|
||||
meet your end goal.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/resourcequota/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
apiVersion: v1
|
||||
kind: LimitRange
|
||||
metadata:
|
||||
name: limits
|
||||
spec:
|
||||
limits:
|
||||
- default:
|
||||
cpu: 200m
|
||||
memory: 512Mi
|
||||
defaultRequest:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
type: Container
|
|
@ -0,0 +1,4 @@
|
|||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: quota-example
|
|
@ -0,0 +1,14 @@
|
|||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: quota
|
||||
spec:
|
||||
hard:
|
||||
cpu: "20"
|
||||
memory: 1Gi
|
||||
persistentvolumeclaims: "10"
|
||||
pods: "10"
|
||||
replicationcontrollers: "20"
|
||||
resourcequotas: "1"
|
||||
secrets: "10"
|
||||
services: "5"
|
|
@ -0,0 +1,129 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Using Salt to configure Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Using Salt to configure Kubernetes
|
||||
|
||||
The Kubernetes cluster can be configured using Salt.
|
||||
|
||||
The Salt scripts are shared across multiple hosting providers, so it's important to understand some background information prior to making a modification to ensure your changes do not break hosting Kubernetes across multiple environments. Depending on where you host your Kubernetes cluster, you may be using different operating systems and different networking configurations. As a result, it's important to understand some background information before making Salt changes in order to minimize introducing failures for other hosting providers.
|
||||
|
||||
## Salt cluster setup
|
||||
|
||||
The **salt-master** service runs on the kubernetes-master [(except on the default GCE setup)](#standalone-salt-configuration-on-gce).
|
||||
|
||||
The **salt-minion** service runs on the kubernetes-master and each kubernetes-node in the cluster.
|
||||
|
||||
Each salt-minion service is configured to interact with the **salt-master** service hosted on the kubernetes-master via the **master.conf** file [(except on GCE)](#standalone-salt-configuration-on-gce).
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[root@kubernetes-master] $ cat /etc/salt/minion.d/master.conf
|
||||
master: kubernetes-master
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The salt-master is contacted by each salt-minion and depending upon the machine information presented, the salt-master will provision the machine as either a kubernetes-master or kubernetes-node with all the required capabilities needed to run Kubernetes.
|
||||
|
||||
If you are running the Vagrant based environment, the **salt-api** service is running on the kubernetes-master. It is configured to enable the vagrant user to introspect the salt cluster in order to find out about machines in the Vagrant environment via a REST API.
|
||||
|
||||
## Standalone Salt Configuration on GCE
|
||||
|
||||
On GCE, the master and nodes are all configured as [standalone minions](http://docs.saltstack.com/en/latest/topics/tutorials/standalone_minion.html). The configuration for each VM is derived from the VM's [instance metadata](https://cloud.google.com/compute/docs/metadata) and then stored in Salt grains (`/etc/salt/minion.d/grains.conf`) and pillars (`/srv/salt-overlay/pillar/cluster-params.sls`) that local Salt uses to enforce state.
|
||||
|
||||
All remaining sections that refer to master/minion setups should be ignored for GCE. One fallout of the GCE setup is that the Salt mine doesn't exist - there is no sharing of configuration amongst nodes.
|
||||
|
||||
## Salt security
|
||||
|
||||
*(Not applicable on default GCE setup.)*
|
||||
|
||||
Security is not enabled on the salt-master, and the salt-master is configured to auto-accept incoming requests from minions. It is not recommended to use this security configuration in production environments without deeper study. (In some environments this isn't as bad as it might sound if the salt master port isn't externally accessible and you trust everyone on your network.)
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[root@kubernetes-master] $ cat /etc/salt/master.d/auto-accept.conf
|
||||
open_mode: True
|
||||
auto_accept: True
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## Salt minion configuration
|
||||
|
||||
Each minion in the salt cluster has an associated configuration that instructs the salt-master how to provision the required resources on the machine.
|
||||
|
||||
An example file is presented below using the Vagrant based environment.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[root@kubernetes-master] $ cat /etc/salt/minion.d/grains.conf
|
||||
grains:
|
||||
etcd_servers: $MASTER_IP
|
||||
cloud_provider: vagrant
|
||||
roles:
|
||||
- kubernetes-master
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Each hosting environment has a slightly different grains.conf file that is used to build conditional logic where required in the Salt files.
|
||||
|
||||
The following enumerates the set of defined key/value pairs that are supported today. If you add new ones, please make sure to update this list.
|
||||
|
||||
Key | Value
|
||||
------------- | -------------
|
||||
`api_servers` | (Optional) The IP address / host name where a kubelet can get read-only access to kube-apiserver
|
||||
`cbr-cidr` | (Optional) The minion IP address range used for the docker container bridge.
|
||||
`cloud` | (Optional) Which IaaS platform is used to host Kubernetes, *gce*, *azure*, *aws*, *vagrant*
|
||||
`etcd_servers` | (Optional) Comma-delimited list of IP addresses the kube-apiserver and kubelet use to reach etcd. Uses the IP of the first machine in the kubernetes_master role, or 127.0.0.1 on GCE.
|
||||
`hostnamef` | (Optional) The full host name of the machine, i.e. uname -n
|
||||
`node_ip` | (Optional) The IP address to use to address this node
|
||||
`hostname_override` | (Optional) Mapped to the kubelet hostname-override
|
||||
`network_mode` | (Optional) Networking model to use among nodes: *openvswitch*
|
||||
`networkInterfaceName` | (Optional) Networking interface to use to bind addresses, default value *eth0*
|
||||
`publicAddressOverride` | (Optional) The IP address the kube-apiserver should use to bind against for external read-only access
|
||||
`roles` | (Required) 1. `kubernetes-master` means this machine is the master in the Kubernetes cluster. 2. `kubernetes-pool` means this machine is a kubernetes-node. Depending on the role, the Salt scripts will provision different resources on the machine.
|
||||
|
||||
These keys may be leveraged by the Salt sls files to branch behavior.
|
||||
|
||||
In addition, a cluster may be running a Debian based operating system or Red Hat based operating system (Centos, Fedora, RHEL, etc.). As a result, it's important to sometimes distinguish behavior based on operating system using if branches like the following.
|
||||
|
||||
{% highlight jinja %}
|
||||
{% raw %}
|
||||
{% if grains['os_family'] == 'RedHat' %}
|
||||
// something specific to a RedHat environment (Centos, Fedora, RHEL) where you may use yum, systemd, etc.
|
||||
{% else %}
|
||||
// something specific to Debian environment (apt-get, initd)
|
||||
{% endif %}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. When configuring default arguments for processes, it's best to avoid the use of EnvironmentFiles (Systemd in Red Hat environments) or init.d files (Debian distributions) to hold default values that should be common across operating system environments. This helps keep our Salt template files easy to understand for editors who may not be familiar with the particulars of each distribution.
|
||||
|
||||
## Future enhancements (Networking)
|
||||
|
||||
Per pod IP configuration is provider-specific, so when making networking changes, it's important to sandbox these as all providers may not use the same mechanisms (iptables, openvswitch, etc.)
|
||||
|
||||
We should define a grains.conf key that captures more specifically what network configuration environment is being used to avoid future confusion across providers.
|
||||
|
||||
## Further reading
|
||||
|
||||
The [cluster/saltbase](http://releases.k8s.io/release-1.1/cluster/saltbase/) tree has more details on the current SaltStack configuration.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/salt.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,120 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Cluster Admin Guide to Service Accounts"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Cluster Admin Guide to Service Accounts
|
||||
|
||||
*This is a Cluster Administrator guide to service accounts. It assumes knowledge of
|
||||
the [User Guide to Service Accounts](../user-guide/service-accounts.html).*
|
||||
|
||||
*Support for authorization and user accounts is planned but incomplete. Sometimes
|
||||
incomplete features are referred to in order to better describe service accounts.*
|
||||
|
||||
## User accounts vs service accounts
|
||||
|
||||
Kubernetes distinguished between the concept of a user account and a service accounts
|
||||
for a number of reasons:
|
||||
- User accounts are for humans. Service accounts are for processes, which
|
||||
run in pods.
|
||||
- User accounts are intended to be global. Names must be unique across all
|
||||
namespaces of a cluster, future user resource will not be namespaced).
|
||||
Service accounts are namespaced.
|
||||
- Typically, a cluster's User accounts might be synced from a corporate
|
||||
database, where new user account creation requires special privileges and
|
||||
is tied to complex business processes. Service account creation is intended
|
||||
to be more lightweight, allowing cluster users to create service accounts for
|
||||
specific tasks (i.e. principle of least privilege).
|
||||
- Auditing considerations for humans and service accounts may differ.
|
||||
- A config bundle for a complex system may include definition of various service
|
||||
accounts for components of that system. Because service accounts can be created
|
||||
ad-hoc and have namespaced names, such config is portable.
|
||||
|
||||
## Service account automation
|
||||
|
||||
Three separate components cooperate to implement the automation around service accounts:
|
||||
- A Service account admission controller
|
||||
- A Token controller
|
||||
- A Service account controller
|
||||
|
||||
### Service Account Admission Controller
|
||||
|
||||
The modification of pods is implemented via a plugin
|
||||
called an [Admission Controller](admission-controllers.html). It is part of the apiserver.
|
||||
It acts synchronously to modify pods as they are created or updated. When this plugin is active
|
||||
(and it is by default on most distributions), then it does the following when a pod is created or modified:
|
||||
1. If the pod does not have a `ServiceAccount` set, it sets the `ServiceAccount` to `default`.
|
||||
2. It ensures that the `ServiceAccount` referenced by the pod exists, and otherwise rejects it.
|
||||
4. If the pod does not contain any `ImagePullSecrets`, then `ImagePullSecrets` of the
|
||||
`ServiceAccount` are added to the pod.
|
||||
5. It adds a `volume` to the pod which contains a token for API access.
|
||||
6. It adds a `volumeSource` to each container of the pod mounted at `/var/run/secrets/kubernetes.io/serviceaccount`.
|
||||
|
||||
### Token Controller
|
||||
|
||||
TokenController runs as part of controller-manager. It acts asynchronously. It:
|
||||
- observes serviceAccount creation and creates a corresponding Secret to allow API access.
|
||||
- observes serviceAccount deletion and deletes all corresponding ServiceAccountToken Secrets
|
||||
- observes secret addition, and ensures the referenced ServiceAccount exists, and adds a token to the secret if needed
|
||||
- observes secret deletion and removes a reference from the corresponding ServiceAccount if needed
|
||||
|
||||
#### To create additional API tokens
|
||||
|
||||
A controller loop ensures a secret with an API token exists for each service
|
||||
account. To create additional API tokens for a service account, create a secret
|
||||
of type `ServiceAccountToken` with an annotation referencing the service
|
||||
account, and the controller will update it with a generated token:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
secret.json:
|
||||
{
|
||||
"kind": "Secret",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "mysecretname",
|
||||
"annotations": {
|
||||
"kubernetes.io/service-account.name": "myserviceaccount"
|
||||
}
|
||||
},
|
||||
"type": "kubernetes.io/service-account-token"
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
kubectl create -f ./secret.json
|
||||
kubectl describe secret mysecretname
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
#### To delete/invalidate a service account token
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
kubectl delete secret mysecretname
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Service Account Controller
|
||||
|
||||
Service Account Controller manages ServiceAccount inside namespaces, and ensures
|
||||
a ServiceAccount named "default" exists in every active namespace.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/service-accounts-admin.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,165 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Static pods (deprecated)"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Static pods (deprecated)
|
||||
|
||||
**Static pods are to be deprecated and can be removed in any future Kubernetes release!**
|
||||
|
||||
*Static pod* are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. There is no health check though. Static pods are always bound to one kubelet daemon and always run on the same node with it.
|
||||
|
||||
Kubelet automatically creates so-called *mirror pod* on Kubernetes API server for each static pod, so the pods are visible there, but they cannot be controlled from the API server.
|
||||
|
||||
## Static pod creation
|
||||
|
||||
Static pod can be created in two ways: either by using configuration file(s) or by HTTP.
|
||||
|
||||
### Configuration files
|
||||
|
||||
The configuration files are just standard pod definition in json or yaml format in specific directory. Use `kubelet --config=<the directory>` to start kubelet daemon, which periodically scans the directory and creates/deletes static pods as yaml/json files appear/disappear there.
|
||||
|
||||
For example, this is how to start a simple web server as a static pod:
|
||||
|
||||
1. Choose a node where we want to run the static pod. In this example, it's `my-minion1`.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[joe@host ~] $ ssh my-minion1
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
2. Choose a directory, say `/etc/kubelet.d` and place a web server pod definition there, e.g. `/etc/kubernetes.d/static-web.yaml`:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[root@my-minion1 ~] $ mkdir /etc/kubernetes.d/
|
||||
[root@my-minion1 ~] $ cat <<EOF >/etc/kubernetes.d/static-web.yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: static-web
|
||||
labels:
|
||||
role: myrole
|
||||
spec:
|
||||
containers:
|
||||
- name: web
|
||||
image: nginx
|
||||
ports:
|
||||
- name: web
|
||||
containerPort: 80
|
||||
protocol: tcp
|
||||
EOF
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
2. Configure your kubelet daemon on the node to use this directory by running it with `--config=/etc/kubelet.d/` argument. On Fedora Fedora 21 with Kubernetes 0.17 edit `/etc/kubernetes/kubelet` to include this line:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --config=/etc/kubelet.d/"
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Instructions for other distributions or Kubernetes installations may vary.
|
||||
|
||||
3. Restart kubelet. On Fedora 21, this is:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[root@my-minion1 ~] $ systemctl restart kubelet
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## Pods created via HTTP
|
||||
|
||||
Kubelet periodically downloads a file specified by `--manifest-url=<URL>` argument and interprets it as a json/yaml file with a pod definition. It works the same as `--config=<directory>`, i.e. it's reloaded every now and then and changes are applied to running static pods (see below).
|
||||
|
||||
## Behavior of static pods
|
||||
|
||||
When kubelet starts, it automatically starts all pods defined in directory specified in `--config=` or `--manifest-url=` arguments, i.e. our static-web. (It may take some time to pull nginx image, be patient…):
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[joe@my-minion1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES
|
||||
f6d05272b57e nginx:latest "nginx" 8 minutes ago Up 8 minutes k8s_web.6f802af4_static-web-fk-minion1_default_67e24ed9466ba55986d120c867395f3c_378e5f3c
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
If we look at our Kubernetes API server (running on host `my-master`), we see that a new mirror-pod was created there too:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[joe@host ~] $ ssh my-master
|
||||
[joe@my-master ~] $ kubectl get pods
|
||||
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE
|
||||
static-web-my-minion1 172.17.0.3 my-minion1/192.168.100.71 role=myrole Running 11 minutes
|
||||
web nginx Running 11 minutes
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Labels from the static pod are propagated into the mirror-pod and can be used as usual for filtering.
|
||||
|
||||
Notice we cannot delete the pod with the API server (e.g. via [`kubectl`](../user-guide/kubectl/kubectl.html) command), kubelet simply won't remove it.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[joe@my-master ~] $ kubectl delete pod static-web-my-minion1
|
||||
pods/static-web-my-minion1
|
||||
[joe@my-master ~] $ kubectl get pods
|
||||
POD IP CONTAINER(S) IMAGE(S) HOST ...
|
||||
static-web-my-minion1 172.17.0.3 my-minion1/192.168.100.71 ...
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Back to our `my-minion1` host, we can try to stop the container manually and see, that kubelet automatically restarts it in a while:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[joe@host ~] $ ssh my-minion1
|
||||
[joe@my-minion1 ~] $ docker stop f6d05272b57e
|
||||
[joe@my-minion1 ~] $ sleep 20
|
||||
[joe@my-minion1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED ...
|
||||
5b920cbaf8b1 nginx:latest "nginx -g 'daemon of 2 seconds ago ...
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## Dynamic addition and removal of static pods
|
||||
|
||||
Running kubelet periodically scans the configured directory (`/etc/kubelet.d` in our example) for changes and adds/removes pods as files appear/disappear in this directory.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
[joe@my-minion1 ~] $ mv /etc/kubernetes.d/static-web.yaml /tmp
|
||||
[joe@my-minion1 ~] $ sleep 20
|
||||
[joe@my-minion1 ~] $ docker ps
|
||||
// no nginx container is running
|
||||
[joe@my-minion1 ~] $ mv /tmp/static-web.yaml /etc/kubernetes.d/
|
||||
[joe@my-minion1 ~] $ sleep 20
|
||||
[joe@my-minion1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED ...
|
||||
e7a62e3427f1 nginx:latest "nginx -g 'daemon of 27 seconds ago
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/static-pods.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,147 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "The Kubernetes API"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# The Kubernetes API
|
||||
|
||||
Primary system and API concepts are documented in the [User guide](user-guide/README.html).
|
||||
|
||||
Overall API conventions are described in the [API conventions doc](devel/api-conventions.html).
|
||||
|
||||
Complete API details are documented via [Swagger](http://swagger.io/). The Kubernetes apiserver (aka "master") exports an API that can be used to retrieve the [Swagger spec](https://github.com/swagger-api/swagger-spec/tree/master/schemas/v1.2) for the Kubernetes API, by default at `/swaggerapi`, and a UI you can use to browse the API documentation at `/swagger-ui`. We also periodically update a [statically generated UI](http://kubernetes.io/third_party/swagger-ui/).
|
||||
|
||||
Remote access to the API is discussed in the [access doc](admin/accessing-the-api.html).
|
||||
|
||||
The Kubernetes API also serves as the foundation for the declarative configuration schema for the system. The [Kubectl](user-guide/kubectl/kubectl.html) command-line tool can be used to create, update, delete, and get API objects.
|
||||
|
||||
Kubernetes also stores its serialized state (currently in [etcd](https://coreos.com/docs/distributed-configuration/getting-started-with-etcd/)) in terms of the API resources.
|
||||
|
||||
Kubernetes itself is decomposed into multiple components, which interact through its API.
|
||||
|
||||
## API changes
|
||||
|
||||
In our experience, any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, we expect the Kubernetes API to continuously change and grow. However, we intend to not break compatibility with existing clients, for an extended period of time. In general, new API resources and new resource fields can be expected to be added frequently. Elimination of resources or fields will require following a deprecation process. The precise deprecation policy for eliminating features is TBD, but once we reach our 1.0 milestone, there will be a specific policy.
|
||||
|
||||
What constitutes a compatible change and how to change the API are detailed by the [API change document](devel/api_changes.html).
|
||||
|
||||
## API versioning
|
||||
|
||||
To make it easier to eliminate fields or restructure resource representations, Kubernetes supports
|
||||
multiple API versions, each at a different API path, such as `/api/v1` or
|
||||
`/apis/extensions/v1beta1`.
|
||||
|
||||
We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-lifed and/or experimental APIs.
|
||||
|
||||
Note that API versioning and Software versioning are only indirectly related. The [API and release
|
||||
versioning proposal](design/versioning.html) describes the relationship between API versioning and
|
||||
software versioning.
|
||||
|
||||
|
||||
Different API versions imply different levels of stability and support. The criteria for each level are described
|
||||
in more detail in the [API Changes documentation](devel/api_changes.html#alpha-beta-and-stable-versions). They are summarized here:
|
||||
|
||||
- Alpha level:
|
||||
- The version names contain `alpha` (e.g. `v1alpha1`).
|
||||
- May be buggy. Enabling the feature may expose bugs. Disabled by default.
|
||||
- Support for feature may be dropped at any time without notice.
|
||||
- The API may change in incompatible ways in a later software release without notice.
|
||||
- Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
|
||||
- Beta level:
|
||||
- The version names contain `beta` (e.g. `v2beta3`).
|
||||
- Code is well tested. Enabling the feature is considered safe. Enabled by default.
|
||||
- Support for the overall feature will not be dropped, though details may change.
|
||||
- The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens,
|
||||
we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating
|
||||
API objects. The editing process may require some thought. This may require downtime for appplications that rely on the feature.
|
||||
- Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have
|
||||
multiple clusters which can be upgraded independently, you may be able to relax this restriction.
|
||||
- **Please do try our beta features and give feedback on them! Once they exit beta, it may not be practical for us to make more changes.**
|
||||
- Stable level:
|
||||
- The version name is `vX` where `X` is an integer.
|
||||
- Stable versions of features will appear in released software for many subsequent versions.
|
||||
|
||||
## API groups
|
||||
|
||||
To make it easier to extend the Kubernetes API, we are in the process of implementing [*API
|
||||
groups*](proposals/api-group.html). These are simply different interfaces to read and/or modify the
|
||||
same underlying resources. The API group is specified in a REST path and in the `apiVersion` field
|
||||
of a serialized object.
|
||||
|
||||
Currently there are two API groups in use:
|
||||
|
||||
1. the "core" group, which is at REST path `/api/v1` and is not specified as part of the `apiVersion` field, e.g.
|
||||
`apiVersion: v1`.
|
||||
1. the "extensions" group, which is at REST path `/apis/extensions/$VERSION`, and which uses
|
||||
`apiVersion: extensions/$VERSION` (e.g. currently `apiVersion: extensions/v1beta1`).
|
||||
|
||||
In the future we expect that there will be more API groups, all at REST path `/apis/$API_GROUP` and
|
||||
using `apiVersion: $API_GROUP/$VERSION`. We expect that there will be a way for (third parties to
|
||||
create their own API groups](design/extending-api.md), and to avoid naming collisions.
|
||||
|
||||
## Enabling resources in the extensions group
|
||||
|
||||
Jobs, Ingress and HorizontalPodAutoscalers are enabled by default.
|
||||
Other extensions resources can be enabled by setting runtime-config on
|
||||
apiserver. runtime-config accepts comma separated values. For ex: to enable deployments and disable jobs, set
|
||||
`--runtime-config=extensions/v1beta1/deployments=true,extensions/v1beta1/jobs=false`
|
||||
|
||||
## v1beta1, v1beta2, and v1beta3 are deprecated; please move to v1 ASAP
|
||||
|
||||
As of June 4, 2015, the Kubernetes v1 API has been enabled by default. The v1beta1 and v1beta2 APIs were deleted on June 1, 2015. v1beta3 is planned to be deleted on July 6, 2015.
|
||||
|
||||
### v1 conversion tips (from v1beta3)
|
||||
|
||||
We're working to convert all documentation and examples to v1. A simple [API conversion tool](admin/cluster-management.html#switching-your-config-files-to-a-new-api-version) has been written to simplify the translation process. Use `kubectl create --validate` in order to validate your json or yaml against our Swagger spec.
|
||||
|
||||
Changes to services are the most significant difference between v1beta3 and v1.
|
||||
|
||||
* The `service.spec.portalIP` property is renamed to `service.spec.clusterIP`.
|
||||
* The `service.spec.createExternalLoadBalancer` property is removed. Specify `service.spec.type: "LoadBalancer"` to create an external load balancer instead.
|
||||
* The `service.spec.publicIPs` property is deprecated and now called `service.spec.deprecatedPublicIPs`. This property will be removed entirely when v1beta3 is removed. The vast majority of users of this field were using it to expose services on ports on the node. Those users should specify `service.spec.type: "NodePort"` instead. Read [External Services](user-guide/services.html#external-services) for more info. If this is not sufficient for your use case, please file an issue or contact @thockin.
|
||||
|
||||
Some other difference between v1beta3 and v1:
|
||||
|
||||
* The `pod.spec.containers[*].privileged` and `pod.spec.containers[*].capabilities` properties are now nested under the `pod.spec.containers[*].securityContext` property. See [Security Contexts](user-guide/security-context.html).
|
||||
* The `pod.spec.host` property is renamed to `pod.spec.nodeName`.
|
||||
* The `endpoints.subsets[*].addresses.IP` property is renamed to `endpoints.subsets[*].addresses.ip`.
|
||||
* The `pod.status.containerStatuses[*].state.termination` and `pod.status.containerStatuses[*].lastState.termination` properties are renamed to `pod.status.containerStatuses[*].state.terminated` and `pod.status.containerStatuses[*].lastState.terminated` respectively.
|
||||
* The `pod.status.Condition` property is renamed to `pod.status.conditions`.
|
||||
* The `status.details.id` property is renamed to `status.details.name`.
|
||||
|
||||
### v1beta3 conversion tips (from v1beta1/2)
|
||||
|
||||
Some important differences between v1beta1/2 and v1beta3:
|
||||
|
||||
* The resource `id` is now called `name`.
|
||||
* `name`, `labels`, `annotations`, and other metadata are now nested in a map called `metadata`
|
||||
* `desiredState` is now called `spec`, and `currentState` is now called `status`
|
||||
* `/minions` has been moved to `/nodes`, and the resource has kind `Node`
|
||||
* The namespace is required (for all namespaced resources) and has moved from a URL parameter to the path: `/api/v1beta3/namespaces/{namespace}/{resource_collection}/{resource_name}`. If you were not using a namespace before, use `default` here.
|
||||
* The names of all resource collections are now lower cased - instead of `replicationControllers`, use `replicationcontrollers`.
|
||||
* To watch for changes to a resource, open an HTTP or Websocket connection to the collection query and provide the `?watch=true` query parameter along with the desired `resourceVersion` parameter to watch from.
|
||||
* The `labels` query parameter has been renamed to `labelSelector`.
|
||||
* The `fields` query parameter has been renamed to `fieldSelector`.
|
||||
* The container `entrypoint` has been renamed to `command`, and `command` has been renamed to `args`.
|
||||
* Container, volume, and node resources are expressed as nested maps (e.g., `resources{cpu:1}`) rather than as individual fields, and resource values support [scaling suffixes](user-guide/compute-resources.html#specifying-resource-quantities) rather than fixed scales (e.g., milli-cores).
|
||||
* Restart policy is represented simply as a string (e.g., `"Always"`) rather than as a nested map (`always{}`).
|
||||
* Pull policies changed from `PullAlways`, `PullNever`, and `PullIfNotPresent` to `Always`, `Never`, and `IfNotPresent`.
|
||||
* The volume `source` is inlined into `volume` rather than nested.
|
||||
* Host volumes have been changed from `hostDir` to `hostPath` to better reflect that they can be files or directories.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/api.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,39 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Design Overview"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Design Overview
|
||||
|
||||
Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.
|
||||
|
||||
Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration.
|
||||
|
||||
Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways.
|
||||
|
||||
Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary.
|
||||
|
||||
Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts.
|
||||
|
||||
A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the multi-cluster doc](../admin/multi-cluster.html) and [cluster federation proposal](../proposals/federation.html) for more details).
|
||||
|
||||
Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner.
|
||||
|
||||
For more about the Kubernetes architecture, see [architecture](architecture.html).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,278 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "K8s Identity and Access Management Sketch"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# K8s Identity and Access Management Sketch
|
||||
|
||||
This document suggests a direction for identity and access management in the Kubernetes system.
|
||||
|
||||
|
||||
## Background
|
||||
|
||||
High level goals are:
|
||||
- Have a plan for how identity, authentication, and authorization will fit in to the API.
|
||||
- Have a plan for partitioning resources within a cluster between independent organizational units.
|
||||
- Ease integration with existing enterprise and hosted scenarios.
|
||||
|
||||
### Actors
|
||||
|
||||
Each of these can act as normal users or attackers.
|
||||
- External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access.
|
||||
- K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods)
|
||||
- K8s Project Admins: People who manage access for some K8s Users
|
||||
- K8s Cluster Admins: People who control the machines, networks, or binaries that make up a K8s cluster.
|
||||
- K8s Admin means K8s Cluster Admins and K8s Project Admins taken together.
|
||||
|
||||
### Threats
|
||||
|
||||
Both intentional attacks and accidental use of privilege are concerns.
|
||||
|
||||
For both cases it may be useful to think about these categories differently:
|
||||
- Application Path - attack by sending network messages from the internet to the IP/port of any application running on K8s. May exploit weakness in application or misconfiguration of K8s.
|
||||
- K8s API Path - attack by sending network messages to any K8s API endpoint.
|
||||
- Insider Path - attack on K8s system components. Attacker may have privileged access to networks, machines or K8s software and data. Software errors in K8s system components and administrator error are some types of threat in this category.
|
||||
|
||||
This document is primarily concerned with K8s API paths, and secondarily with Internal paths. The Application path also needs to be secure, but is not the focus of this document.
|
||||
|
||||
### Assets to protect
|
||||
|
||||
External User assets:
|
||||
- Personal information like private messages, or images uploaded by External Users.
|
||||
- web server logs.
|
||||
|
||||
K8s User assets:
|
||||
- External User assets of each K8s User.
|
||||
- things private to the K8s app, like:
|
||||
- credentials for accessing other services (docker private repos, storage services, facebook, etc)
|
||||
- SSL certificates for web servers
|
||||
- proprietary data and code
|
||||
|
||||
K8s Cluster assets:
|
||||
- Assets of each K8s User.
|
||||
- Machine Certificates or secrets.
|
||||
- The value of K8s cluster computing resources (cpu, memory, etc).
|
||||
|
||||
This document is primarily about protecting K8s User assets and K8s cluster assets from other K8s Users and K8s Project and Cluster Admins.
|
||||
|
||||
### Usage environments
|
||||
|
||||
Cluster in Small organization:
|
||||
- K8s Admins may be the same people as K8s Users.
|
||||
- few K8s Admins.
|
||||
- prefer ease of use to fine-grained access control/precise accounting, etc.
|
||||
- Product requirement that it be easy for potential K8s Cluster Admin to try out setting up a simple cluster.
|
||||
|
||||
Cluster in Large organization:
|
||||
- K8s Admins typically distinct people from K8s Users. May need to divide K8s Cluster Admin access by roles.
|
||||
- K8s Users need to be protected from each other.
|
||||
- Auditing of K8s User and K8s Admin actions important.
|
||||
- flexible accurate usage accounting and resource controls important.
|
||||
- Lots of automated access to APIs.
|
||||
- Need to integrate with existing enterprise directory, authentication, accounting, auditing, and security policy infrastructure.
|
||||
|
||||
Org-run cluster:
|
||||
- organization that runs K8s master components is same as the org that runs apps on K8s.
|
||||
- Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix.
|
||||
|
||||
Hosted cluster:
|
||||
- Offering K8s API as a service, or offering a Paas or Saas built on K8s.
|
||||
- May already offer web services, and need to integrate with existing customer account concept, and existing authentication, accounting, auditing, and security policy infrastructure.
|
||||
- May want to leverage K8s User accounts and accounting to manage their User accounts (not a priority to support this use case.)
|
||||
- Precise and accurate accounting of resources needed. Resource controls needed for hard limits (Users given limited slice of data) and soft limits (Users can grow up to some limit and then be expanded).
|
||||
|
||||
K8s ecosystem services:
|
||||
- There may be companies that want to offer their existing services (Build, CI, A/B-test, release automation, etc) for use with K8s. There should be some story for this case.
|
||||
|
||||
Pods configs should be largely portable between Org-run and hosted configurations.
|
||||
|
||||
|
||||
# Design
|
||||
|
||||
Related discussion:
|
||||
- http://issue.k8s.io/442
|
||||
- http://issue.k8s.io/443
|
||||
|
||||
This doc describes two security profiles:
|
||||
- Simple profile: like single-user mode. Make it easy to evaluate K8s without lots of configuring accounts and policies. Protects from unauthorized users, but does not partition authorized users.
|
||||
- Enterprise profile: Provide mechanisms needed for large numbers of users. Defense in depth. Should integrate with existing enterprise security infrastructure.
|
||||
|
||||
K8s distribution should include templates of config, and documentation, for simple and enterprise profiles. System should be flexible enough for knowledgeable users to create intermediate profiles, but K8s developers should only reason about those two Profiles, not a matrix.
|
||||
|
||||
Features in this doc are divided into "Initial Feature", and "Improvements". Initial features would be candidates for version 1.00.
|
||||
|
||||
## Identity
|
||||
|
||||
### userAccount
|
||||
|
||||
K8s will have a `userAccount` API object.
|
||||
- `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs.
|
||||
- `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field.
|
||||
- `userAccount` is not related to the unix username of processes in Pods created by that userAccount.
|
||||
- `userAccount` API objects can have labels.
|
||||
|
||||
The system may associate one or more Authentication Methods with a
|
||||
`userAccount` (but they are not formally part of the userAccount object.)
|
||||
In a simple deployment, the authentication method for a
|
||||
user might be an authentication token which is verified by a K8s server. In a
|
||||
more complex deployment, the authentication might be delegated to
|
||||
another system which is trusted by the K8s API to authenticate users, but where
|
||||
the authentication details are unknown to K8s.
|
||||
|
||||
Initial Features:
|
||||
- there is no superuser `userAccount`
|
||||
- `userAccount` objects are statically populated in the K8s API store by reading a config file. Only a K8s Cluster Admin can do this.
|
||||
- `userAccount` can have a default `namespace`. If API call does not specify a `namespace`, the default `namespace` for that caller is assumed.
|
||||
- `userAccount` is global. A single human with access to multiple namespaces is recommended to only have one userAccount.
|
||||
|
||||
Improvements:
|
||||
- Make `userAccount` part of a separate API group from core K8s objects like `pod`. Facilitates plugging in alternate Access Management.
|
||||
|
||||
Simple Profile:
|
||||
- single `userAccount`, used by all K8s Users and Project Admins. One access token shared by all.
|
||||
|
||||
Enterprise Profile:
|
||||
- every human user has own `userAccount`.
|
||||
- `userAccount`s have labels that indicate both membership in groups, and ability to act in certain roles.
|
||||
- each service using the API has own `userAccount` too. (e.g. `scheduler`, `repcontroller`)
|
||||
- automated jobs to denormalize the ldap group info into the local system list of users into the K8s userAccount file.
|
||||
|
||||
### Unix accounts
|
||||
|
||||
A `userAccount` is not a Unix user account. The fact that a pod is started by a `userAccount` does not mean that the processes in that pod's containers run as a Unix user with a corresponding name or identity.
|
||||
|
||||
Initially:
|
||||
- The unix accounts available in a container, and used by the processes running in a container are those that are provided by the combination of the base operating system and the Docker manifest.
|
||||
- Kubernetes doesn't enforce any relation between `userAccount` and unix accounts.
|
||||
|
||||
Improvements:
|
||||
- Kubelet allocates disjoint blocks of root-namespace uids for each container. This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572)
|
||||
- requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids.
|
||||
- any features that help users avoid use of privileged containers (http://issue.k8s.io/391)
|
||||
|
||||
### Namespaces
|
||||
|
||||
K8s will have a have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies.
|
||||
|
||||
Namespaces are described in [namespaces.md](namespaces.html).
|
||||
|
||||
In the Enterprise Profile:
|
||||
- a `userAccount` may have permission to access several `namespace`s.
|
||||
|
||||
In the Simple Profile:
|
||||
- There is a single `namespace` used by the single user.
|
||||
|
||||
Namespaces versus userAccount vs Labels:
|
||||
- `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s.
|
||||
- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.html)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities.
|
||||
- `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people.
|
||||
|
||||
|
||||
## Authentication
|
||||
|
||||
Goals for K8s authentication:
|
||||
- Include a built-in authentication system with no configuration required to use in single-user mode, and little configuration required to add several user accounts, and no https proxy required.
|
||||
- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The Kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users.
|
||||
- For organizations whose security requirements only allow FIPS compliant implementations (e.g. apache) for authentication.
|
||||
- So the proxy can terminate SSL, and isolate the CA-signed certificate from less trusted, higher-touch APIserver.
|
||||
- For organizations that already have existing SaaS web services (e.g. storage, VMs) and want a common authentication portal.
|
||||
- Avoid mixing authentication and authorization, so that authorization policies be centrally managed, and to allow changes in authentication methods without affecting authorization code.
|
||||
|
||||
Initially:
|
||||
- Tokens used to authenticate a user.
|
||||
- Long lived tokens identify a particular `userAccount`.
|
||||
- Administrator utility generates tokens at cluster setup.
|
||||
- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750
|
||||
- No scopes for tokens. Authorization happens in the API server
|
||||
- Tokens dynamically generated by apiserver to identify pods which are making API calls.
|
||||
- Tokens checked in a module of the APIserver.
|
||||
- Authentication in apiserver can be disabled by flag, to allow testing without authorization enabled, and to allow use of an authenticating proxy. In this mode, a query parameter or header added by the proxy will identify the caller.
|
||||
|
||||
Improvements:
|
||||
- Refresh of tokens.
|
||||
- SSH keys to access inside containers.
|
||||
|
||||
To be considered for subsequent versions:
|
||||
- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749)
|
||||
- Scoped tokens.
|
||||
- Tokens that are bound to the channel between the client and the api server
|
||||
- http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf
|
||||
- http://www.browserauth.net
|
||||
|
||||
|
||||
## Authorization
|
||||
|
||||
K8s authorization should:
|
||||
- Allow for a range of maturity levels, from single-user for those test driving the system, to integration with existing to enterprise authorization systems.
|
||||
- Allow for centralized management of users and policies. In some organizations, this will mean that the definition of users and access policies needs to reside on a system other than k8s and encompass other web services (such as a storage service).
|
||||
- Allow processes running in K8s Pods to take on identity, and to allow narrow scoping of permissions for those identities in order to limit damage from software faults.
|
||||
- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Replication Controllers, Services, and the identities and policies for those Pods and Replication Controllers.
|
||||
- Be separate as much as practical from Authentication, to allow Authentication methods to change over time and space, without impacting Authorization policies.
|
||||
|
||||
K8s will implement a relatively simple
|
||||
[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model.
|
||||
The model will be described in more detail in a forthcoming document. The model will
|
||||
- Be less complex than XACML
|
||||
- Be easily recognizable to those familiar with Amazon IAM Policies.
|
||||
- Have a subset/aliases/defaults which allow it to be used in a way comfortable to those users more familiar with Role-Based Access Control.
|
||||
|
||||
Authorization policy is set by creating a set of Policy objects.
|
||||
|
||||
The API Server will be the Enforcement Point for Policy. For each API call that it receives, it will construct the Attributes needed to evaluate the policy (what user is making the call, what resource they are accessing, what they are trying to do that resource, etc) and pass those attributes to a Decision Point. The Decision Point code evaluates the Attributes against all the Policies and allows or denies the API call. The system will be modular enough that the Decision Point code can either be linked into the APIserver binary, or be another service that the apiserver calls for each Decision (with appropriate time-limited caching as needed for performance).
|
||||
|
||||
Policy objects may be applicable only to a single namespace or to all namespaces; K8s Project Admins would be able to create those as needed. Other Policy objects may be applicable to all namespaces; a K8s Cluster Admin might create those in order to authorize a new type of controller to be used by all namespaces, or to make a K8s User into a K8s Project Admin.)
|
||||
|
||||
|
||||
## Accounting
|
||||
|
||||
The API should have a `quota` concept (see http://issue.k8s.io/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources design doc](resources.html)).
|
||||
|
||||
Initially:
|
||||
- a `quota` object is immutable.
|
||||
- for hosted K8s systems that do billing, Project is recommended level for billing accounts.
|
||||
- Every object that consumes resources should have a `namespace` so that Resource usage stats are roll-up-able to `namespace`.
|
||||
- K8s Cluster Admin sets quota objects by writing a config file.
|
||||
|
||||
Improvements:
|
||||
- allow one namespace to charge the quota for one or more other namespaces. This would be controlled by a policy which allows changing a billing_namespace= label on an object.
|
||||
- allow quota to be set by namespace owners for (namespace x label) combinations (e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't allow "webserver" namespace and "instance=test" use more than 10 cores.
|
||||
- tools to help write consistent quota config files based on number of nodes, historical namespace usages, QoS needs, etc.
|
||||
- way for K8s Cluster Admin to incrementally adjust Quota objects.
|
||||
|
||||
Simple profile:
|
||||
- a single `namespace` with infinite resource limits.
|
||||
|
||||
Enterprise profile:
|
||||
- multiple namespaces each with their own limits.
|
||||
|
||||
Issues:
|
||||
- need for locking or "eventual consistency" when multiple apiserver goroutines are accessing the object store and handling pod creations.
|
||||
|
||||
|
||||
## Audit Logging
|
||||
|
||||
API actions can be logged.
|
||||
|
||||
Initial implementation:
|
||||
- All API calls logged to nginx logs.
|
||||
|
||||
Improvements:
|
||||
- API server does logging instead.
|
||||
- Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,106 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Proposal - Admission Control"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Proposal - Admission Control
|
||||
|
||||
**Related PR:**
|
||||
|
||||
| Topic | Link |
|
||||
| ----- | ---- |
|
||||
| Separate validation from RESTStorage | http://issue.k8s.io/2977 |
|
||||
|
||||
## Background
|
||||
|
||||
High level goals:
|
||||
|
||||
* Enable an easy-to-use mechanism to provide admission control to cluster
|
||||
* Enable a provider to support multiple admission control strategies or author their own
|
||||
* Ensure any rejected request can propagate errors back to the caller with why the request failed
|
||||
|
||||
Authorization via policy is focused on answering if a user is authorized to perform an action.
|
||||
|
||||
Admission Control is focused on if the system will accept an authorized action.
|
||||
|
||||
Kubernetes may choose to dismiss an authorized action based on any number of admission control strategies.
|
||||
|
||||
This proposal documents the basic design, and describes how any number of admission control plug-ins could be injected.
|
||||
|
||||
Implementation of specific admission control strategies are handled in separate documents.
|
||||
|
||||
## kube-apiserver
|
||||
|
||||
The kube-apiserver takes the following OPTIONAL arguments to enable admission control
|
||||
|
||||
| Option | Behavior |
|
||||
| ------ | -------- |
|
||||
| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. |
|
||||
| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. |
|
||||
|
||||
An **AdmissionControl** plug-in is an implementation of the following interface:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
package admission
|
||||
|
||||
// Attributes is an interface used by a plug-in to make an admission decision on a individual request.
|
||||
type Attributes interface {
|
||||
GetNamespace() string
|
||||
GetKind() string
|
||||
GetOperation() string
|
||||
GetObject() runtime.Object
|
||||
}
|
||||
|
||||
// Interface is an abstract, pluggable interface for Admission Control decisions.
|
||||
type Interface interface {
|
||||
// Admit makes an admission decision based on the request attributes
|
||||
// An error is returned if it denies the request.
|
||||
Admit(a Attributes) (err error)
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
A **plug-in** must be compiled with the binary, and is registered as an available option by providing a name, and implementation
|
||||
of admission.Interface.
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
func init() {
|
||||
admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil })
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Invocation of admission control is handled by the **APIServer** and not individual **RESTStorage** implementations.
|
||||
|
||||
This design assumes that **Issue 297** is adopted, and as a consequence, the general framework of the APIServer request/response flow will ensure the following:
|
||||
|
||||
1. Incoming request
|
||||
2. Authenticate user
|
||||
3. Authorize user
|
||||
4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes)
|
||||
- invoke each admission.Interface object in sequence
|
||||
5. Case on the operation:
|
||||
- If operation=create|update, then validate(object) and persist
|
||||
- If operation=delete, delete the object
|
||||
- If operation=connect, exec
|
||||
|
||||
If at any step, there is an error, the request is canceled.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,219 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Admission control plugin: LimitRanger"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Admission control plugin: LimitRanger
|
||||
|
||||
## Background
|
||||
|
||||
This document proposes a system for enforcing resource requirements constraints as part of admission control.
|
||||
|
||||
## Use cases
|
||||
|
||||
1. Ability to enumerate resource requirement constraints per namespace
|
||||
2. Ability to enumerate min/max resource constraints for a pod
|
||||
3. Ability to enumerate min/max resource constraints for a container
|
||||
4. Ability to specify default resource limits for a container
|
||||
5. Ability to specify default resource requests for a container
|
||||
6. Ability to enforce a ratio between request and limit for a resource.
|
||||
|
||||
## Data Model
|
||||
|
||||
The **LimitRange** resource is scoped to a **Namespace**.
|
||||
|
||||
### Type
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
// LimitType is a type of object that is limited
|
||||
type LimitType string
|
||||
|
||||
const (
|
||||
// Limit that applies to all pods in a namespace
|
||||
LimitTypePod LimitType = "Pod"
|
||||
// Limit that applies to all containers in a namespace
|
||||
LimitTypeContainer LimitType = "Container"
|
||||
)
|
||||
|
||||
// LimitRangeItem defines a min/max usage limit for any resource that matches on kind.
|
||||
type LimitRangeItem struct {
|
||||
// Type of resource that this limit applies to.
|
||||
Type LimitType `json:"type,omitempty"`
|
||||
// Max usage constraints on this kind by resource name.
|
||||
Max ResourceList `json:"max,omitempty"`
|
||||
// Min usage constraints on this kind by resource name.
|
||||
Min ResourceList `json:"min,omitempty"`
|
||||
// Default resource requirement limit value by resource name if resource limit is omitted.
|
||||
Default ResourceList `json:"default,omitempty"`
|
||||
// DefaultRequest is the default resource requirement request value by resource name if resource request is omitted.
|
||||
DefaultRequest ResourceList `json:"defaultRequest,omitempty"`
|
||||
// MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource.
|
||||
MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"`
|
||||
}
|
||||
|
||||
// LimitRangeSpec defines a min/max usage limit for resources that match on kind.
|
||||
type LimitRangeSpec struct {
|
||||
// Limits is the list of LimitRangeItem objects that are enforced.
|
||||
Limits []LimitRangeItem `json:"limits"`
|
||||
}
|
||||
|
||||
// LimitRange sets resource usage limits for each kind of resource in a Namespace.
|
||||
type LimitRange struct {
|
||||
TypeMeta `json:",inline"`
|
||||
// Standard object's metadata.
|
||||
// More info: http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#metadata
|
||||
ObjectMeta `json:"metadata,omitempty"`
|
||||
|
||||
// Spec defines the limits enforced.
|
||||
// More info: http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#spec-and-status
|
||||
Spec LimitRangeSpec `json:"spec,omitempty"`
|
||||
}
|
||||
|
||||
// LimitRangeList is a list of LimitRange items.
|
||||
type LimitRangeList struct {
|
||||
TypeMeta `json:",inline"`
|
||||
// Standard list metadata.
|
||||
// More info: http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#types-kinds
|
||||
ListMeta `json:"metadata,omitempty"`
|
||||
|
||||
// Items is a list of LimitRange objects.
|
||||
// More info: http://releases.k8s.io/release-1.1/docs/design/admission_control_limit_range.md
|
||||
Items []LimitRange `json:"items"`
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Validation
|
||||
|
||||
Validation of a **LimitRange** enforces that for a given named resource the following rules apply:
|
||||
|
||||
Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) <= Max (if specified)
|
||||
|
||||
### Default Value Behavior
|
||||
|
||||
The following default value behaviors are applied to a LimitRange for a given named resource.
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
if LimitRangeItem.Default[resourceName] is undefined
|
||||
if LimitRangeItem.Max[resourceName] is defined
|
||||
LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName]
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
if LimitRangeItem.DefaultRequest[resourceName] is undefined
|
||||
if LimitRangeItem.Default[resourceName] is defined
|
||||
LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName]
|
||||
else if LimitRangeItem.Min[resourceName] is defined
|
||||
LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName]
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
## AdmissionControl plugin: LimitRanger
|
||||
|
||||
The **LimitRanger** plug-in introspects all incoming pod requests and evaluates the constraints defined on a LimitRange.
|
||||
|
||||
If a constraint is not specified for an enumerated resource, it is not enforced or tracked.
|
||||
|
||||
To enable the plug-in and support for LimitRange, the kube-apiserver must be configured as follows:
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kube-apiserver --admission-control=LimitRanger
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Enforcement of constraints
|
||||
|
||||
**Type: Container**
|
||||
|
||||
Supported Resources:
|
||||
|
||||
1. memory
|
||||
2. cpu
|
||||
|
||||
Supported Constraints:
|
||||
|
||||
Per container, the following must hold true
|
||||
|
||||
| Constraint | Behavior |
|
||||
| ---------- | -------- |
|
||||
| Min | Min <= Request (required) <= Limit (optional) |
|
||||
| Max | Limit (required) <= Max |
|
||||
| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) |
|
||||
|
||||
Supported Defaults:
|
||||
|
||||
1. Default - if the named resource has no enumerated value, the Limit is equal to the Default
|
||||
2. DefaultRequest - if the named resource has no enumerated value, the Request is equal to the DefaultRequest
|
||||
|
||||
**Type: Pod**
|
||||
|
||||
Supported Resources:
|
||||
|
||||
1. memory
|
||||
2. cpu
|
||||
|
||||
Supported Constraints:
|
||||
|
||||
Across all containers in pod, the following must hold true
|
||||
|
||||
| Constraint | Behavior |
|
||||
| ---------- | -------- |
|
||||
| Min | Min <= Request (required) <= Limit (optional) |
|
||||
| Max | Limit (required) <= Max |
|
||||
| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) |
|
||||
|
||||
## Run-time configuration
|
||||
|
||||
The default ```LimitRange``` that is applied via Salt configuration will be updated as follows:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
apiVersion: "v1"
|
||||
kind: "LimitRange"
|
||||
metadata:
|
||||
name: "limits"
|
||||
namespace: default
|
||||
spec:
|
||||
limits:
|
||||
- type: "Container"
|
||||
defaultRequests:
|
||||
cpu: "100m"
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
## Example
|
||||
|
||||
An example LimitRange configuration:
|
||||
|
||||
| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio |
|
||||
| ---- | -------- | --- | --- | ------- | -------------- | ----------------- |
|
||||
| Container | cpu | .1 | 1 | 500m | 250m | 4 |
|
||||
| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | |
|
||||
|
||||
Assuming an incoming container that specified no incoming resource requirements,
|
||||
the following would happen.
|
||||
|
||||
1. The incoming container cpu would request 250m with a limit of 500m.
|
||||
2. The incoming container memory would request 250Mi with a limit of 500Mi
|
||||
3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4.
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,219 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Admission control plugin: ResourceQuota"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Admission control plugin: ResourceQuota
|
||||
|
||||
## Background
|
||||
|
||||
This document describes a system for enforcing hard resource usage limits per namespace as part of admission control.
|
||||
|
||||
## Use cases
|
||||
|
||||
1. Ability to enumerate resource usage limits per namespace.
|
||||
2. Ability to monitor resource usage for tracked resources.
|
||||
3. Ability to reject resource usage exceeding hard quotas.
|
||||
|
||||
## Data Model
|
||||
|
||||
The **ResourceQuota** object is scoped to a **Namespace**.
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
// The following identify resource constants for Kubernetes object types
|
||||
const (
|
||||
// Pods, number
|
||||
ResourcePods ResourceName = "pods"
|
||||
// Services, number
|
||||
ResourceServices ResourceName = "services"
|
||||
// ReplicationControllers, number
|
||||
ResourceReplicationControllers ResourceName = "replicationcontrollers"
|
||||
// ResourceQuotas, number
|
||||
ResourceQuotas ResourceName = "resourcequotas"
|
||||
// ResourceSecrets, number
|
||||
ResourceSecrets ResourceName = "secrets"
|
||||
// ResourcePersistentVolumeClaims, number
|
||||
ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims"
|
||||
)
|
||||
|
||||
// ResourceQuotaSpec defines the desired hard limits to enforce for Quota
|
||||
type ResourceQuotaSpec struct {
|
||||
// Hard is the set of desired hard limits for each named resource
|
||||
Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
|
||||
}
|
||||
|
||||
// ResourceQuotaStatus defines the enforced hard limits and observed use
|
||||
type ResourceQuotaStatus struct {
|
||||
// Hard is the set of enforced hard limits for each named resource
|
||||
Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
|
||||
// Used is the current observed total usage of the resource in the namespace
|
||||
Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"`
|
||||
}
|
||||
|
||||
// ResourceQuota sets aggregate quota restrictions enforced per namespace
|
||||
type ResourceQuota struct {
|
||||
TypeMeta `json:",inline"`
|
||||
ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#metadata"`
|
||||
|
||||
// Spec defines the desired quota
|
||||
Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#spec-and-status"`
|
||||
|
||||
// Status defines the actual enforced quota and its current usage
|
||||
Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#spec-and-status"`
|
||||
}
|
||||
|
||||
// ResourceQuotaList is a list of ResourceQuota items
|
||||
type ResourceQuotaList struct {
|
||||
TypeMeta `json:",inline"`
|
||||
ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#metadata"`
|
||||
|
||||
// Items is a list of ResourceQuota objects
|
||||
Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## Quota Tracked Resources
|
||||
|
||||
The following resources are supported by the quota system.
|
||||
|
||||
| Resource | Description |
|
||||
| ------------ | ----------- |
|
||||
| cpu | Total requested cpu usage |
|
||||
| memory | Total requested memory usage |
|
||||
| pods | Total number of active pods where phase is pending or active. |
|
||||
| services | Total number of services |
|
||||
| replicationcontrollers | Total number of replication controllers |
|
||||
| resourcequotas | Total number of resource quotas |
|
||||
| secrets | Total number of secrets |
|
||||
| persistentvolumeclaims | Total number of persistent volume claims |
|
||||
|
||||
If a third-party wants to track additional resources, it must follow the resource naming conventions prescribed
|
||||
by Kubernetes. This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource)
|
||||
|
||||
## Resource Requirements: Requests vs Limits
|
||||
|
||||
If a resource supports the ability to distinguish between a request and a limit for a resource,
|
||||
the quota tracking system will only cost the request value against the quota usage. If a resource
|
||||
is tracked by quota, and no request value is provided, the associated entity is rejected as part of admission.
|
||||
|
||||
For an example, consider the following scenarios relative to tracking quota on CPU:
|
||||
|
||||
| Pod | Container | Request CPU | Limit CPU | Result |
|
||||
| --- | --------- | ----------- | --------- | ------ |
|
||||
| X | C1 | 100m | 500m | The quota usage is incremented 100m |
|
||||
| Y | C2 | 100m | none | The quota usage is incremented 100m |
|
||||
| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit |
|
||||
| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. |
|
||||
|
||||
The rationale for accounting for the requested amount of a resource versus the limit is the belief
|
||||
that a user should only be charged for what they are scheduled against in the cluster. In addition,
|
||||
attempting to track usage against actual usage, where request < actual < limit, is considered highly
|
||||
volatile.
|
||||
|
||||
As a consequence of this decision, the user is able to spread its usage of a resource across multiple tiers
|
||||
of service. Let's demonstrate this via an example with a 4 cpu quota.
|
||||
|
||||
The quota may be allocated as follows:
|
||||
|
||||
| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage |
|
||||
| --- | --------- | ----------- | --------- | ---- | ----------- |
|
||||
| X | C1 | 1 | 4 | Burstable | 1 |
|
||||
| Y | C2 | 2 | 2 | Guaranteed | 2 |
|
||||
| Z | C3 | 1 | 3 | Burstable | 1 |
|
||||
|
||||
It is possible that the pods may consume 9 cpu over a given time period depending on the nodes available cpu
|
||||
that held pod X and Z, but since we scheduled X and Z relative to the request, we only track the requesting
|
||||
value against their allocated quota. If one wants to restrict the ratio between the request and limit,
|
||||
it is encouraged that the user define a **LimitRange** with **LimitRequestRatio** to control burst out behavior.
|
||||
This would in effect, let an administrator keep the difference between request and limit more in line with
|
||||
tracked usage if desired.
|
||||
|
||||
## Status API
|
||||
|
||||
A REST API endpoint to update the status section of the **ResourceQuota** is exposed. It requires an atomic compare-and-swap
|
||||
in order to keep resource usage tracking consistent.
|
||||
|
||||
## Resource Quota Controller
|
||||
|
||||
A resource quota controller monitors observed usage for tracked resources in the **Namespace**.
|
||||
|
||||
If there is observed difference between the current usage stats versus the current **ResourceQuota.Status**, the controller
|
||||
posts an update of the currently observed usage metrics to the **ResourceQuota** via the /status endpoint.
|
||||
|
||||
The resource quota controller is the only component capable of monitoring and recording usage updates after a DELETE operation
|
||||
since admission control is incapable of guaranteeing a DELETE request actually succeeded.
|
||||
|
||||
## AdmissionControl plugin: ResourceQuota
|
||||
|
||||
The **ResourceQuota** plug-in introspects all incoming admission requests.
|
||||
|
||||
To enable the plug-in and support for ResourceQuota, the kube-apiserver must be configured as follows:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
$ kube-apiserver --admission-control=ResourceQuota
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request
|
||||
namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied.
|
||||
|
||||
If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a
|
||||
**ResourceQuota.Status** document to the server to atomically update the observed usage based on the previously read
|
||||
**ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally)
|
||||
into the system.
|
||||
|
||||
To optimize system performance, it is encouraged that all resource quotas are tracked on the same **ResourceQuota** document in a **Namespace**. As a result, its encouraged to impose a cap on the total number of individual quotas that are tracked in the **Namespace**
|
||||
to 1 in the **ResourceQuota** document.
|
||||
|
||||
## kubectl
|
||||
|
||||
kubectl is modified to support the **ResourceQuota** resource.
|
||||
|
||||
`kubectl describe` provides a human-readable output of quota.
|
||||
|
||||
For example,
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
|
||||
namespace "quota-example" created
|
||||
$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example
|
||||
resourcequota "quota" created
|
||||
$ kubectl describe quota quota --namespace=quota-example
|
||||
Name: quota
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
cpu 0 20
|
||||
memory 0 1Gi
|
||||
persistentvolumeclaims 0 10
|
||||
pods 0 10
|
||||
replicationcontrollers 0 20
|
||||
resourcequotas 1 1
|
||||
secrets 1 10
|
||||
services 0 5
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## More information
|
||||
|
||||
See [resource quota document](../admin/resource-quota.html) and the [example of Resource Quota](../admin/resourcequota/) for more information.
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes architecture"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes architecture
|
||||
|
||||
A running Kubernetes cluster contains node agents (`kubelet`) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making `kubelet` itself (all our components, really) run within containers, and making the scheduler 100% pluggable.
|
||||
|
||||
![Architecture Diagram](architecture.png?raw=true "Architecture overview")
|
||||
|
||||
## The Kubernetes Node
|
||||
|
||||
When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that compose the cluster-level control plane.
|
||||
|
||||
The Kubernetes node has the services necessary to run application containers and be managed from the master systems.
|
||||
|
||||
Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers.
|
||||
|
||||
### `kubelet`
|
||||
|
||||
The `kubelet` manages [pods](../user-guide/pods.html) and their containers, their images, their volumes, etc.
|
||||
|
||||
### `kube-proxy`
|
||||
|
||||
Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services.html) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends.
|
||||
|
||||
Service endpoints are currently found via [DNS](../admin/dns.html) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are supported). These variables resolve to ports managed by the service proxy.
|
||||
|
||||
## The Kubernetes Control Plane
|
||||
|
||||
The Kubernetes control plane is split into a set of components. Currently they all run on a single _master_ node, but that is expected to change soon in order to support high-availability clusters. These components work together to provide a unified view of the cluster.
|
||||
|
||||
### `etcd`
|
||||
|
||||
All persistent master state is stored in an instance of `etcd`. This provides a great way to store configuration data reliably. With `watch` support, coordinating components can be notified very quickly of changes.
|
||||
|
||||
### Kubernetes API Server
|
||||
|
||||
The apiserver serves up the [Kubernetes API](../api.html). It is intended to be a CRUD-y server, with most/all business logic implemented in separate components or in plug-ins. It mainly processes REST operations, validates them, and updates the corresponding objects in `etcd` (and eventually other stores).
|
||||
|
||||
### Scheduler
|
||||
|
||||
The scheduler binds unscheduled pods to nodes via the `/binding` API. The scheduler is pluggable, and we expect to support multiple cluster schedulers and even user-provided schedulers in the future.
|
||||
|
||||
### Kubernetes Controller Manager Server
|
||||
|
||||
All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable.
|
||||
|
||||
The [`replicationcontroller`](../user-guide/replication-controller.html) is a mechanism that is layered on top of the simple [`pod`](../user-guide/pods.html) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
After Width: | Height: | Size: 262 KiB |
After Width: | Height: | Size: 50 KiB |
|
@ -0,0 +1,83 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Clustering in Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Clustering in Kubernetes
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
The term "clustering" refers to the process of having all members of the Kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address.
|
||||
|
||||
Once a cluster is established, the following is true:
|
||||
|
||||
1. **Master -> Node** The master needs to know which nodes can take work and what their current status is wrt capacity.
|
||||
1. **Location** The master knows the name and location of all of the nodes in the cluster.
|
||||
* For the purposes of this doc, location and name should be enough information so that the master can open a TCP connection to the Node. Most probably we will make this either an IP address or a DNS name. It is going to be important to be consistent here (master must be able to reach kubelet on that DNS name) so that we can verify certificates appropriately.
|
||||
2. **Target AuthN** A way to securely talk to the kubelet on that node. Currently we call out to the kubelet over HTTP. This should be over HTTPS and the master should know what CA to trust for that node.
|
||||
3. **Caller AuthN/Z** This would be the master verifying itself (and permissions) when calling the node. Currently, this is only used to collect statistics as authorization isn't critical. This may change in the future though.
|
||||
2. **Node -> Master** The nodes currently talk to the master to know which pods have been assigned to them and to publish events.
|
||||
1. **Location** The nodes must know where the master is at.
|
||||
2. **Target AuthN** Since the master is assigning work to the nodes, it is critical that they verify whom they are talking to.
|
||||
3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to the master. Ideally this authentication is specific to each node so that authorization can be narrowly scoped. The details of the work to run (including things like environment variables) might be considered sensitive and should be locked down also.
|
||||
|
||||
**Note:** While the description here refers to a singular Master, in the future we should enable multiple Masters operating in an HA mode. While the "Master" is currently the combination of the API Server, Scheduler and Controller Manager, we will restrict ourselves to thinking about the main API and policy engine -- the API Server.
|
||||
|
||||
## Current Implementation
|
||||
|
||||
A central authority (generally the master) is responsible for determining the set of machines which are members of the cluster. Calls to create and remove worker nodes in the cluster are restricted to this single authority, and any other requests to add or remove worker nodes are rejected. (1.i).
|
||||
|
||||
Communication from the master to nodes is currently over HTTP and is not secured or authenticated in any way. (1.ii, 1.iii).
|
||||
|
||||
The location of the master is communicated out of band to the nodes. For GCE, this is done via Salt. Other cluster instructions/scripts use other methods. (2.i)
|
||||
|
||||
Currently most communication from the node to the master is over HTTP. When it is done over HTTPS there is currently no verification of the cert of the master (2.ii).
|
||||
|
||||
Currently, the node/kubelet is authenticated to the master via a token shared across all nodes. This token is distributed out of band (using Salt for GCE) and is optional. If it is not present then the kubelet is unable to publish events to the master. (2.iii)
|
||||
|
||||
Our current mix of out of band communication doesn't meet all of our needs from a security point of view and is difficult to set up and configure.
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
The proposed solution will provide a range of options for setting up and maintaining a secure Kubernetes cluster. We want to both allow for centrally controlled systems (leveraging pre-existing trust and configuration systems) or more ad-hoc automagic systems that are incredibly easy to set up.
|
||||
|
||||
The building blocks of an easier solution:
|
||||
|
||||
* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly identify the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN.
|
||||
* [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate.
|
||||
* **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors.
|
||||
* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give a node permission to register itself.
|
||||
* To start with, we'd have the kubelets generate a cert/account in the form of `kubelet:<host>`. To start we would then hard code policy such that we give that particular account appropriate permissions. Over time, we can make the policy engine more generic.
|
||||
* [optional] **Bootstrap API endpoint** This is a helper service hosted outside of the Kubernetes cluster that helps with initial discovery of the master.
|
||||
|
||||
### Static Clustering
|
||||
|
||||
In this sequence diagram there is out of band admin entity that is creating all certificates and distributing them. It is also making sure that the kubelets know where to find the master. This provides for a lot of control but is more difficult to set up as lots of information must be communicated outside of Kubernetes.
|
||||
|
||||
![Static Sequence Diagram](clustering/static.png)
|
||||
|
||||
### Dynamic Clustering
|
||||
|
||||
This diagram dynamic clustering using the bootstrap API endpoint. That API endpoint is used to both find the location of the master and communicate the root CA for the master.
|
||||
|
||||
This flow has the admin manually approving the kubelet signing requests. This is the `queue` policy defined above.This manual intervention could be replaced by code that can verify the signing requests via other means.
|
||||
|
||||
![Dynamic Sequence Diagram](clustering/dynamic.png)
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1 @@
|
|||
DroidSansMono.ttf
|
|
@ -0,0 +1,12 @@
|
|||
FROM debian:jessie
|
||||
|
||||
RUN apt-get update
|
||||
RUN apt-get -qy install python-seqdiag make curl
|
||||
|
||||
WORKDIR /diagrams
|
||||
|
||||
RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf
|
||||
|
||||
ADD . /diagrams
|
||||
|
||||
CMD bash -c 'make >/dev/stderr && tar cf - *.png'
|
|
@ -0,0 +1,29 @@
|
|||
FONT := DroidSansMono.ttf
|
||||
|
||||
PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag))
|
||||
|
||||
.PHONY: all
|
||||
all: $(PNGS)
|
||||
|
||||
.PHONY: watch
|
||||
watch:
|
||||
fswatch *.seqdiag | xargs -n 1 sh -c "make || true"
|
||||
|
||||
$(FONT):
|
||||
curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT)
|
||||
|
||||
%.png: %.seqdiag $(FONT)
|
||||
seqdiag --no-transparency -a -f '$(FONT)' $<
|
||||
|
||||
# Build the stuff via a docker image
|
||||
.PHONY: docker
|
||||
docker:
|
||||
docker build -t clustering-seqdiag .
|
||||
docker run --rm clustering-seqdiag | tar xvf -
|
||||
|
||||
docker-clean:
|
||||
docker rmi clustering-seqdiag || true
|
||||
docker images -q --filter "dangling=true" | xargs docker rmi
|
||||
|
||||
fix-clock-skew:
|
||||
boot2docker ssh sudo date -u -D "%Y%m%d%H%M.%S" --set "$(shell date -u +%Y%m%d%H%M.%S)"
|
|
@ -0,0 +1,52 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Building with Docker"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
This directory contains diagrams for the clustering design doc.
|
||||
|
||||
This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). Assuming you have a non-borked python install, this should be installable with
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
pip install seqdiag
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Just call `make` to regenerate the diagrams.
|
||||
|
||||
## Building with Docker
|
||||
|
||||
If you are on a Mac or your pip install is messed up, you can easily build with docker.
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
make docker
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The first run will be slow but things should be fast after that.
|
||||
|
||||
To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`.
|
||||
|
||||
If you are using boot2docker and get warnings about clock skew (or if things aren't building for some reason) then you can fix that up with `make fix-clock-skew`.
|
||||
|
||||
## Automatically rebuild on file changes
|
||||
|
||||
If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
After Width: | Height: | Size: 71 KiB |
|
@ -0,0 +1,24 @@
|
|||
seqdiag {
|
||||
activation = none;
|
||||
|
||||
|
||||
user[label = "Admin User"];
|
||||
bootstrap[label = "Bootstrap API\nEndpoint"];
|
||||
master;
|
||||
kubelet[stacked];
|
||||
|
||||
user -> bootstrap [label="createCluster", return="cluster ID"];
|
||||
user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"];
|
||||
|
||||
user ->> master [label="start\n- bootstrap-cluster-uri"];
|
||||
master => bootstrap [label="setMaster\n- master-location\n- master-ca"];
|
||||
|
||||
user ->> kubelet [label="start\n- bootstrap-cluster-uri"];
|
||||
kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"];
|
||||
kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="retuns\n- kubelet-cert"];
|
||||
user => master [label="getSignRequests"];
|
||||
user => master [label="approveSignRequests"];
|
||||
kubelet <<-- master [label="returns\n- kubelet-cert"];
|
||||
|
||||
kubelet => master [label="register\n- kubelet-location"]
|
||||
}
|
|
@ -0,0 +1,52 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Building with Docker"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
This directory contains diagrams for the clustering design doc.
|
||||
|
||||
This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). Assuming you have a non-borked python install, this should be installable with
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
pip install seqdiag
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Just call `make` to regenerate the diagrams.
|
||||
|
||||
## Building with Docker
|
||||
|
||||
If you are on a Mac or your pip install is messed up, you can easily build with docker.
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
make docker
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The first run will be slow but things should be fast after that.
|
||||
|
||||
To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`.
|
||||
|
||||
If you are using boot2docker and get warnings about clock skew (or if things aren't building for some reason) then you can fix that up with `make fix-clock-skew`.
|
||||
|
||||
## Automatically rebuild on file changes
|
||||
|
||||
If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
After Width: | Height: | Size: 36 KiB |
|
@ -0,0 +1,16 @@
|
|||
seqdiag {
|
||||
activation = none;
|
||||
|
||||
admin[label = "Manual Admin"];
|
||||
ca[label = "Manual CA"]
|
||||
master;
|
||||
kubelet[stacked];
|
||||
|
||||
admin => ca [label="create\n- master-cert"];
|
||||
admin ->> master [label="start\n- ca-root\n- master-cert"];
|
||||
|
||||
admin => ca [label="create\n- kubelet-cert"];
|
||||
admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"];
|
||||
|
||||
kubelet => master [label="register\n- kubelet-location"];
|
||||
}
|
|
@ -0,0 +1,168 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Container Command Execution & Port Forwarding in Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Container Command Execution & Port Forwarding in Kubernetes
|
||||
|
||||
## Abstract
|
||||
|
||||
This describes an approach for providing support for:
|
||||
|
||||
- executing commands in containers, with stdin/stdout/stderr streams attached
|
||||
- port forwarding to containers
|
||||
|
||||
## Background
|
||||
|
||||
There are several related issues/PRs:
|
||||
|
||||
- [Support attach](http://issue.k8s.io/1521)
|
||||
- [Real container ssh](http://issue.k8s.io/1513)
|
||||
- [Provide easy debug network access to services](http://issue.k8s.io/1863)
|
||||
- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576)
|
||||
|
||||
## Motivation
|
||||
|
||||
Users and administrators are accustomed to being able to access their systems
|
||||
via SSH to run remote commands, get shell access, and do port forwarding.
|
||||
|
||||
Supporting SSH to containers in Kubernetes is a difficult task. You must
|
||||
specify a "user" and a hostname to make an SSH connection, and `sshd` requires
|
||||
real users (resolvable by NSS and PAM). Because a container belongs to a pod,
|
||||
and the pod belongs to a namespace, you need to specify namespace/pod/container
|
||||
to uniquely identify the target container. Unfortunately, a
|
||||
namespace/pod/container is not a real user as far as SSH is concerned. Also,
|
||||
most Linux systems limit user names to 32 characters, which is unlikely to be
|
||||
large enough to contain namespace/pod/container. We could devise some scheme to
|
||||
map each namespace/pod/container to a 32-character user name, adding entries to
|
||||
`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the
|
||||
time. Alternatively, we could write custom NSS and PAM modules that allow the
|
||||
host to resolve a namespace/pod/container to a user without needing to keep
|
||||
files or LDAP in sync.
|
||||
|
||||
As an alternative to SSH, we are using a multiplexed streaming protocol that
|
||||
runs on top of HTTP. There are no requirements about users being real users,
|
||||
nor is there any limitation on user name length, as the protocol is under our
|
||||
control. The only downside is that standard tooling that expects to use SSH
|
||||
won't be able to work with this mechanism, unless adapters can be written.
|
||||
|
||||
## Constraints and Assumptions
|
||||
|
||||
- SSH support is not currently in scope
|
||||
- CGroup confinement is ultimately desired, but implementing that support is not currently in scope
|
||||
- SELinux confinement is ultimately desired, but implementing that support is not currently in scope
|
||||
|
||||
## Use Cases
|
||||
|
||||
- As a user of a Kubernetes cluster, I want to run arbitrary commands in a container, attaching my local stdin/stdout/stderr to the container
|
||||
- As a user of a Kubernetes cluster, I want to be able to connect to local ports on my computer and have them forwarded to ports in the container
|
||||
|
||||
## Process Flow
|
||||
|
||||
### Remote Command Execution Flow
|
||||
|
||||
1. The client connects to the Kubernetes Master to initiate a remote command execution
|
||||
request
|
||||
2. The Master proxies the request to the Kubelet where the container lives
|
||||
3. The Kubelet executes nsenter + the requested command and streams stdin/stdout/stderr back and forth between the client and the container
|
||||
|
||||
### Port Forwarding Flow
|
||||
|
||||
1. The client connects to the Kubernetes Master to initiate a remote command execution
|
||||
request
|
||||
2. The Master proxies the request to the Kubelet where the container lives
|
||||
3. The client listens on each specified local port, awaiting local connections
|
||||
4. The client connects to one of the local listening ports
|
||||
4. The client notifies the Kubelet of the new connection
|
||||
5. The Kubelet executes nsenter + socat and streams data back and forth between the client and the port in the container
|
||||
|
||||
|
||||
## Design Considerations
|
||||
|
||||
### Streaming Protocol
|
||||
|
||||
The current multiplexed streaming protocol used is SPDY. This is not the
|
||||
long-term desire, however. As soon as there is viable support for HTTP/2 in Go,
|
||||
we will switch to that.
|
||||
|
||||
### Master as First Level Proxy
|
||||
|
||||
Clients should not be allowed to communicate directly with the Kubelet for
|
||||
security reasons. Therefore, the Master is currently the only suggested entry
|
||||
point to be used for remote command execution and port forwarding. This is not
|
||||
necessarily desirable, as it means that all remote command execution and port
|
||||
forwarding traffic must travel through the Master, potentially impacting other
|
||||
API requests.
|
||||
|
||||
In the future, it might make more sense to retrieve an authorization token from
|
||||
the Master, and then use that token to initiate a remote command execution or
|
||||
port forwarding request with a load balanced proxy service dedicated to this
|
||||
functionality. This would keep the streaming traffic out of the Master.
|
||||
|
||||
### Kubelet as Backend Proxy
|
||||
|
||||
The kubelet is currently responsible for handling remote command execution and
|
||||
port forwarding requests. Just like with the Master described above, this means
|
||||
that all remote command execution and port forwarding streaming traffic must
|
||||
travel through the Kubelet, which could result in a degraded ability to service
|
||||
other requests.
|
||||
|
||||
In the future, it might make more sense to use a separate service on the node.
|
||||
|
||||
Alternatively, we could possibly inject a process into the container that only
|
||||
listens for a single request, expose that process's listening port on the node,
|
||||
and then issue a redirect to the client such that it would connect to the first
|
||||
level proxy, which would then proxy directly to the injected process's exposed
|
||||
port. This would minimize the amount of proxying that takes place.
|
||||
|
||||
### Scalability
|
||||
|
||||
There are at least 2 different ways to execute a command in a container:
|
||||
`docker exec` and `nsenter`. While `docker exec` might seem like an easier and
|
||||
more obvious choice, it has some drawbacks.
|
||||
|
||||
#### `docker exec`
|
||||
|
||||
We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port
|
||||
on the node), but this would require proxying from the edge and securing the
|
||||
Docker API. `docker exec` calls go through the Docker daemon, meaning that all
|
||||
stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop.
|
||||
Additionally, you can't isolate 1 malicious `docker exec` call from normal
|
||||
usage, meaning an attacker could initiate a denial of service or other attack
|
||||
and take down the Docker daemon, or the node itself.
|
||||
|
||||
We expect remote command execution and port forwarding requests to be long
|
||||
running and/or high bandwidth operations, and routing all the streaming data
|
||||
through the Docker daemon feels like a bottleneck we can avoid.
|
||||
|
||||
#### `nsenter`
|
||||
|
||||
The implementation currently uses `nsenter` to run commands in containers,
|
||||
joining the appropriate container namespaces. `nsenter` runs directly on the
|
||||
node and is not proxied through any single daemon process.
|
||||
|
||||
### Security
|
||||
|
||||
Authentication and authorization hasn't specifically been tested yet with this
|
||||
functionality. We need to make sure that users are not allowed to execute
|
||||
remote commands or do port forwarding to containers they aren't allowed to
|
||||
access.
|
||||
|
||||
Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,145 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "DaemonSet in Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# DaemonSet in Kubernetes
|
||||
|
||||
**Author**: Ananya Kumar (@AnanyaKumar)
|
||||
|
||||
**Status**: Implemented.
|
||||
|
||||
This document presents the design of the Kubernetes DaemonSet, describes use cases, and gives an overview of the code.
|
||||
|
||||
## Motivation
|
||||
|
||||
Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the DaemonSet, a way to conveniently create and manage daemon-like workloads in Kubernetes.
|
||||
|
||||
## Use Cases
|
||||
|
||||
The DaemonSet can be used for user-specified system services, cluster-level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category.
|
||||
|
||||
### User-Specified System Services:
|
||||
|
||||
Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The DaemonSet can be used to run a data collection service (for example fluentd) on every node and send the data to a service like ElasticSearch for analysis.
|
||||
|
||||
### Cluster-Level Applications
|
||||
|
||||
Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘app=datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A DaemonSet is a convenient way to implement such a datastore.
|
||||
|
||||
For other uses, see the related [feature request](https://issues.k8s.io/1518)
|
||||
|
||||
## Functionality
|
||||
|
||||
The DaemonSet supports standard API features:
|
||||
- create
|
||||
- The spec for DaemonSets has a pod template field.
|
||||
- Using the pod’s nodeSelector field, DaemonSets can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a datastore pod on exactly those nodes labeled ‘app=database’.
|
||||
- Using the pod's nodeName field, DaemonSets can be restricted to operate on a specified node.
|
||||
- The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec used by the Replication Controller.
|
||||
- The initial implementation will not guarnatee that DaemonSet pods are created on nodes before other pods.
|
||||
- The initial implementation of DaemonSet does not guarantee that DaemonSet pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch DaemonSet pods (like Replication Controllers do with pods). Subsequent revisions might ensure that DaemonSet pods show up on nodes, preempting other pods if necessary.
|
||||
- The DaemonSet controller adds an annotation "kubernetes.io/created-by: \<json API object reference\>"
|
||||
- YAML example:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
labels:
|
||||
app: datastore
|
||||
name: datastore
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: datastore-shard
|
||||
spec:
|
||||
nodeSelector:
|
||||
app: datastore-node
|
||||
containers:
|
||||
name: datastore-shard
|
||||
image: kubernetes/sharded
|
||||
ports:
|
||||
- containerPort: 9042
|
||||
name: main
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
- commands that get info
|
||||
- get (e.g. kubectl get daemonsets)
|
||||
- describe
|
||||
- Modifiers
|
||||
- delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is unlikely to be set on any node); then it deletes the DaemonSet; then it deletes the pods)
|
||||
- label
|
||||
- annotate
|
||||
- update operations like patch and replace (only allowed to selector and to nodeSelector and nodeName of pod template)
|
||||
- DaemonSets have labels, so you could, for example, list all DaemonSets with certain labels (the same way you would for a Replication Controller).
|
||||
- In general, for all the supported features like get, describe, update, etc, the DaemonSet works in a similar way to the Replication Controller. However, note that the DaemonSet and the Replication Controller are different constructs.
|
||||
|
||||
### Persisting Pods
|
||||
|
||||
- Ordinary liveness probes specified in the pod template work to keep pods created by a DaemonSet running.
|
||||
- If a daemon pod is killed or stopped, the DaemonSet will create a new replica of the daemon pod on the node.
|
||||
|
||||
### Cluster Mutations
|
||||
|
||||
- When a new node is added to the cluster, the DaemonSet controller starts daemon pods on the node for DaemonSets whose pod template nodeSelectors match the node’s labels.
|
||||
- Suppose the user launches a DaemonSet that runs a logging daemon on all nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label to a node (that did not initially have the label), the logging daemon will launch on the node. Additionally, if a user removes the label from a node, the logging daemon on that node will be killed.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
We considered several alternatives, that were deemed inferior to the approach of creating a new DaemonSet abstraction.
|
||||
|
||||
One alternative is to include the daemon in the machine image. In this case it would run outside of Kubernetes proper, and thus not be monitored, health checked, usable as a service endpoint, easily upgradable, etc.
|
||||
|
||||
A related alternative is to package daemons as static pods. This would address most of the problems described above, but they would still not be easily upgradable, and more generally could not be managed through the API server interface.
|
||||
|
||||
A third alternative is to generalize the Replication Controller. We would do something like: if you set the `replicas` field of the ReplicationConrollerSpec to -1, then it means "run exactly one replica on every node matching the nodeSelector in the pod template." The ReplicationController would pretend `replicas` had been set to some large number -- larger than the largest number of nodes ever expected in the cluster -- and would use some anti-affinity mechanism to ensure that no more than one Pod from the ReplicationController runs on any given node. There are two downsides to this approach. First, there would always be a large number of Pending pods in the scheduler (these will be scheduled onto new machines when they are added to the cluster). The second downside is more philosophical: DaemonSet and the Replication Controller are very different concepts. We believe that having small, targeted controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having larger multi-functional controllers (see ["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for some discussion of this topic).
|
||||
|
||||
## Design
|
||||
|
||||
#### Client
|
||||
|
||||
- Add support for DaemonSet commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API.
|
||||
|
||||
#### Apiserver
|
||||
|
||||
- Accept, parse, validate client commands
|
||||
- REST API calls are handled in registry/daemon
|
||||
- In particular, the api server will add the object to etcd
|
||||
- DaemonManager listens for updates to etcd (using Framework.informer)
|
||||
- API objects for DaemonSet were created in expapi/v1/types.go and expapi/v1/register.go
|
||||
- Validation code is in expapi/validation
|
||||
|
||||
#### Daemon Manager
|
||||
|
||||
- Creates new DaemonSets when requested. Launches the corresponding daemon pod on all nodes with labels matching the new DaemonSet’s selector.
|
||||
- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each DaemonSet. If the label of the node matches the selector of the DaemonSet, then the daemon manager will create the corresponding daemon pod in the new node.
|
||||
- The daemon manager creates a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname)
|
||||
|
||||
#### Kubelet
|
||||
|
||||
- Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject DaemonSet objects with pod templates that don’t have restartPolicy set to Always.
|
||||
|
||||
## Open Issues
|
||||
|
||||
- Should work similarly to [Deployment](http://issues.k8s.io/1743).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Event Compression"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Event Compression
|
||||
|
||||
This document captures the design of event compression.
|
||||
|
||||
|
||||
## Background
|
||||
|
||||
Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](http://issue.k8s.io/3853)).
|
||||
|
||||
## Proposal
|
||||
|
||||
Each binary that generates events (for example, `kubelet`) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event.
|
||||
|
||||
Event compression should be best effort (not guaranteed). Meaning, in the worst case, `n` identical (minus timestamp) events may still result in `n` event entries.
|
||||
|
||||
## Design
|
||||
|
||||
Instead of a single Timestamp, each event object [contains](http://releases.k8s.io/release-1.1/pkg/api/types.go#L1111) the following fields:
|
||||
* `FirstTimestamp unversioned.Time`
|
||||
* The date/time of the first occurrence of the event.
|
||||
* `LastTimestamp unversioned.Time`
|
||||
* The date/time of the most recent occurrence of the event.
|
||||
* On first occurrence, this is equal to the FirstTimestamp.
|
||||
* `Count int`
|
||||
* The number of occurrences of this event between FirstTimestamp and LastTimestamp
|
||||
* On first occurrence, this is 1.
|
||||
|
||||
Each binary that generates events:
|
||||
* Maintains a historical record of previously generated events:
|
||||
* Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](https://releases.k8s.io/release-1.1/pkg/client/record/events_cache.go).
|
||||
* The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event:
|
||||
* `event.Source.Component`
|
||||
* `event.Source.Host`
|
||||
* `event.InvolvedObject.Kind`
|
||||
* `event.InvolvedObject.Namespace`
|
||||
* `event.InvolvedObject.Name`
|
||||
* `event.InvolvedObject.UID`
|
||||
* `event.InvolvedObject.APIVersion`
|
||||
* `event.Reason`
|
||||
* `event.Message`
|
||||
* The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache.
|
||||
* When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/release-1.1/pkg/client/unversioned/record/event.go)).
|
||||
* If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd:
|
||||
* The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count.
|
||||
* The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update).
|
||||
* If the key for the new event does not match the key for any previously generated event (meaning none of the above fields match between the new event and any previously generated events), then the event is considered to be new/unique and a new event entry is created in etcd:
|
||||
* The usual POST/create event API is called to create a new event entry in etcd.
|
||||
* An entry for the event is also added to the previously generated events cache.
|
||||
|
||||
## Issues/Risks
|
||||
|
||||
* Compression is not guaranteed, because each component keeps track of event history in memory
|
||||
* An application restart causes event history to be cleared, meaning event history is not preserved across application restarts and compression will not occur across component restarts.
|
||||
* Because an LRU cache is used to keep track of previously generated events, if too many unique events are generated, old events will be evicted from the cache, so events will only be compressed until they age out of the events cache, at which point any new instance of the event will cause a new entry to be created in etcd.
|
||||
|
||||
## Example
|
||||
|
||||
Sample kubectl output
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE
|
||||
Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-minion-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Starting kubelet.
|
||||
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-1.c.saad-dev-vms.internal} Starting kubelet.
|
||||
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-3.c.saad-dev-vms.internal} Starting kubelet.
|
||||
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-2.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-2.c.saad-dev-vms.internal} Starting kubelet.
|
||||
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
|
||||
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
|
||||
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
|
||||
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
|
||||
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
|
||||
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest"
|
||||
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-minion-4.c.saad-dev-vms.internal
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries.
|
||||
|
||||
## Related Pull Requests/Issues
|
||||
|
||||
* Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events
|
||||
* PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API
|
||||
* PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow compressing multiple recurring events in to a single event
|
||||
* PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a single event to optimize etcd storage
|
||||
* PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache instead of map
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,420 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Variable expansion in pod command, args, and env"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Variable expansion in pod command, args, and env
|
||||
|
||||
## Abstract
|
||||
|
||||
A proposal for the expansion of environment variables using a simple `$(var)` syntax.
|
||||
|
||||
## Motivation
|
||||
|
||||
It is extremely common for users to need to compose environment variables or pass arguments to
|
||||
their commands using the values of environment variables. Kubernetes should provide a facility for
|
||||
the 80% cases in order to decrease coupling and the use of workarounds.
|
||||
|
||||
## Goals
|
||||
|
||||
1. Define the syntax format
|
||||
2. Define the scoping and ordering of substitutions
|
||||
3. Define the behavior for unmatched variables
|
||||
4. Define the behavior for unexpected/malformed input
|
||||
|
||||
## Constraints and Assumptions
|
||||
|
||||
* This design should describe the simplest possible syntax to accomplish the use-cases
|
||||
* Expansion syntax will not support more complicated shell-like behaviors such as default values
|
||||
(viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc.
|
||||
|
||||
## Use Cases
|
||||
|
||||
1. As a user, I want to compose new environment variables for a container using a substitution
|
||||
syntax to reference other variables in the container's environment and service environment
|
||||
variables
|
||||
1. As a user, I want to substitute environment variables into a container's command
|
||||
1. As a user, I want to do the above without requiring the container's image to have a shell
|
||||
1. As a user, I want to be able to specify a default value for a service variable which may
|
||||
not exist
|
||||
1. As a user, I want to see an event associated with the pod if an expansion fails (ie, references
|
||||
variable names that cannot be expanded)
|
||||
|
||||
### Use Case: Composition of environment variables
|
||||
|
||||
Currently, containers are injected with docker-style environment variables for the services in
|
||||
their pod's namespace. There are several variables for each service, but users routinely need
|
||||
to compose URLs based on these variables because there is not a variable for the exact format
|
||||
they need. Users should be able to build new environment variables with the exact format they need.
|
||||
Eventually, it should also be possible to turn off the automatic injection of the docker-style
|
||||
variables into pods and let the users consume the exact information they need via the downward API
|
||||
and composition.
|
||||
|
||||
#### Expanding expanded variables
|
||||
|
||||
It should be possible to reference an variable which is itself the result of an expansion, if the
|
||||
referenced variable is declared in the container's environment prior to the one referencing it.
|
||||
Put another way -- a container's environment is expanded in order, and expanded variables are
|
||||
available to subsequent expansions.
|
||||
|
||||
### Use Case: Variable expansion in command
|
||||
|
||||
Users frequently need to pass the values of environment variables to a container's command.
|
||||
Currently, Kubernetes does not perform any expansion of variables. The workaround is to invoke a
|
||||
shell in the container's command and have the shell perform the substitution, or to write a wrapper
|
||||
script that sets up the environment and runs the command. This has a number of drawbacks:
|
||||
|
||||
1. Solutions that require a shell are unfriendly to images that do not contain a shell
|
||||
2. Wrapper scripts make it harder to use images as base images
|
||||
3. Wrapper scripts increase coupling to Kubernetes
|
||||
|
||||
Users should be able to do the 80% case of variable expansion in command without writing a wrapper
|
||||
script or adding a shell invocation to their containers' commands.
|
||||
|
||||
### Use Case: Images without shells
|
||||
|
||||
The current workaround for variable expansion in a container's command requires the container's
|
||||
image to have a shell. This is unfriendly to images that do not contain a shell (`scratch` images,
|
||||
for example). Users should be able to perform the other use-cases in this design without regard to
|
||||
the content of their images.
|
||||
|
||||
### Use Case: See an event for incomplete expansions
|
||||
|
||||
It is possible that a container with incorrect variable values or command line may continue to run
|
||||
for a long period of time, and that the end-user would have no visual or obvious warning of the
|
||||
incorrect configuration. If the kubelet creates an event when an expansion references a variable
|
||||
that cannot be expanded, it will help users quickly detect problems with expansions.
|
||||
|
||||
## Design Considerations
|
||||
|
||||
### What features should be supported?
|
||||
|
||||
In order to limit complexity, we want to provide the right amount of functionality so that the 80%
|
||||
cases can be realized and nothing more. We felt that the essentials boiled down to:
|
||||
|
||||
1. Ability to perform direct expansion of variables in a string
|
||||
2. Ability to specify default values via a prioritized mapping function but without support for
|
||||
defaults as a syntax-level feature
|
||||
|
||||
### What should the syntax be?
|
||||
|
||||
The exact syntax for variable expansion has a large impact on how users perceive and relate to the
|
||||
feature. We considered implementing a very restrictive subset of the shell `${var}` syntax. This
|
||||
syntax is an attractive option on some level, because many people are familiar with it. However,
|
||||
this syntax also has a large number of lesser known features such as the ability to provide
|
||||
default values for unset variables, perform inline substitution, etc.
|
||||
|
||||
In the interest of preventing conflation of the expansion feature in Kubernetes with the shell
|
||||
feature, we chose a different syntax similar to the one in Makefiles, `$(var)`. We also chose not
|
||||
to support the bar `$var` format, since it is not required to implement the required use-cases.
|
||||
|
||||
Nested references, ie, variable expansion within variable names, are not supported.
|
||||
|
||||
#### How should unmatched references be treated?
|
||||
|
||||
Ideally, it should be extremely clear when a variable reference couldn't be expanded. We decided
|
||||
the best experience for unmatched variable references would be to have the entire reference, syntax
|
||||
included, show up in the output. As an example, if the reference `$(VARIABLE_NAME)` cannot be
|
||||
expanded, then `$(VARIABLE_NAME)` should be present in the output.
|
||||
|
||||
#### Escaping the operator
|
||||
|
||||
Although the `$(var)` syntax does overlap with the `$(command)` form of command substitution
|
||||
supported by many shells, because unexpanded variables are present verbatim in the output, we
|
||||
expect this will not present a problem to many users. If there is a collision between a variable
|
||||
name and command substitution syntax, the syntax can be escaped with the form `$$(VARIABLE_NAME)`,
|
||||
which will evaluate to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not.
|
||||
|
||||
## Design
|
||||
|
||||
This design encompasses the variable expansion syntax and specification and the changes needed to
|
||||
incorporate the expansion feature into the container's environment and command.
|
||||
|
||||
### Syntax and expansion mechanics
|
||||
|
||||
This section describes the expansion syntax, evaluation of variable values, and how unexpected or
|
||||
malformed inputs are handled.
|
||||
|
||||
#### Syntax
|
||||
|
||||
The inputs to the expansion feature are:
|
||||
|
||||
1. A utf-8 string (the input string) which may contain variable references
|
||||
2. A function (the mapping function) that maps the name of a variable to the variable's value, of
|
||||
type `func(string) string`
|
||||
|
||||
Variable references in the input string are indicated exclusively with the syntax
|
||||
`$(<variable-name>)`. The syntax tokens are:
|
||||
|
||||
- `$`: the operator
|
||||
- `(`: the reference opener
|
||||
- `)`: the reference closer
|
||||
|
||||
The operator has no meaning unless accompanied by the reference opener and closer tokens. The
|
||||
operator can be escaped using `$$`. One literal `$` will be emitted for each `$$` in the input.
|
||||
|
||||
The reference opener and closer characters have no meaning when not part of a variable reference.
|
||||
If a variable reference is malformed, viz: `$(VARIABLE_NAME` without a closing expression, the
|
||||
operator and expression opening characters are treated as ordinary characters without special
|
||||
meanings.
|
||||
|
||||
#### Scope and ordering of substitutions
|
||||
|
||||
The scope in which variable references are expanded is defined by the mapping function. Within the
|
||||
mapping function, any arbitrary strategy may be used to determine the value of a variable name.
|
||||
The most basic implementation of a mapping function is to use a `map[string]string` to lookup the
|
||||
value of a variable.
|
||||
|
||||
In order to support default values for variables like service variables presented by the kubelet,
|
||||
which may not be bound because the service that provides them does not yet exist, there should be a
|
||||
mapping function that uses a list of `map[string]string` like:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
func MakeMappingFunc(maps ...map[string]string) func(string) string {
|
||||
return func(input string) string {
|
||||
for _, context := range maps {
|
||||
val, ok := context[input]
|
||||
if ok {
|
||||
return val
|
||||
}
|
||||
}
|
||||
|
||||
return ""
|
||||
}
|
||||
}
|
||||
|
||||
// elsewhere
|
||||
containerEnv := map[string]string{
|
||||
"FOO": "BAR",
|
||||
"ZOO": "ZAB",
|
||||
"SERVICE2_HOST": "some-host",
|
||||
}
|
||||
|
||||
serviceEnv := map[string]string{
|
||||
"SERVICE_HOST": "another-host",
|
||||
"SERVICE_PORT": "8083",
|
||||
}
|
||||
|
||||
// single-map variation
|
||||
mapping := MakeMappingFunc(containerEnv)
|
||||
|
||||
// default variables not found in serviceEnv
|
||||
mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv)
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Implementation changes
|
||||
|
||||
The necessary changes to implement this functionality are:
|
||||
|
||||
1. Add a new interface, `ObjectEventRecorder`, which is like the `EventRecorder` interface, but
|
||||
scoped to a single object, and a function that returns an `ObjectEventRecorder` given an
|
||||
`ObjectReference` and an `EventRecorder`
|
||||
2. Introduce `third_party/golang/expansion` package that provides:
|
||||
1. An `Expand(string, func(string) string) string` function
|
||||
2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function
|
||||
3. Make the kubelet expand environment correctly
|
||||
4. Make the kubelet expand command correctly
|
||||
|
||||
#### Event Recording
|
||||
|
||||
In order to provide an event when an expansion references undefined variables, the mapping function
|
||||
must be able to create an event. In order to facilitate this, we should create a new interface in
|
||||
the `api/client/record` package which is similar to `EventRecorder`, but scoped to a single object:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
// ObjectEventRecorder knows how to record events about a single object.
|
||||
type ObjectEventRecorder interface {
|
||||
// Event constructs an event from the given information and puts it in the queue for sending.
|
||||
// 'reason' is the reason this event is generated. 'reason' should be short and unique; it will
|
||||
// be used to automate handling of events, so imagine people writing switch statements to
|
||||
// handle them. You want to make that easy.
|
||||
// 'message' is intended to be human readable.
|
||||
//
|
||||
// The resulting event will be created in the same namespace as the reference object.
|
||||
Event(reason, message string)
|
||||
|
||||
// Eventf is just like Event, but with Sprintf for the message field.
|
||||
Eventf(reason, messageFmt string, args ...interface{})
|
||||
|
||||
// PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field.
|
||||
PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{})
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object`
|
||||
and an `EventRecorder`:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type objectRecorderImpl struct {
|
||||
object runtime.Object
|
||||
recorder EventRecorder
|
||||
}
|
||||
|
||||
func (r *objectRecorderImpl) Event(reason, message string) {
|
||||
r.recorder.Event(r.object, reason, message)
|
||||
}
|
||||
|
||||
func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder {
|
||||
return &objectRecorderImpl{object, recorder}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
#### Expansion package
|
||||
|
||||
The expansion package should provide two methods:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
// MappingFuncFor returns a mapping function for use with Expand that
|
||||
// implements the expansion semantics defined in the expansion spec; it
|
||||
// returns the input string wrapped in the expansion syntax if no mapping
|
||||
// for the input is found. If no expansion is found for a key, an event
|
||||
// is raised on the given recorder.
|
||||
func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string {
|
||||
// ...
|
||||
}
|
||||
|
||||
// Expand replaces variable references in the input string according to
|
||||
// the expansion spec using the given mapping function to resolve the
|
||||
// values of variables.
|
||||
func Expand(input string, mapping func(string) string) string {
|
||||
// ...
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
#### Kubelet changes
|
||||
|
||||
The Kubelet should be made to correctly expand variables references in a container's environment,
|
||||
command, and args. Changes will need to be made to:
|
||||
|
||||
1. The `makeEnvironmentVariables` function in the kubelet; this is used by
|
||||
`GenerateRunContainerOptions`, which is used by both the docker and rkt container runtimes
|
||||
2. The docker manager `setEntrypointAndCommand` func has to be changed to perform variable
|
||||
expansion
|
||||
3. The rkt runtime should be made to support expansion in command and args when support for it is
|
||||
implemented
|
||||
|
||||
### Examples
|
||||
|
||||
#### Inputs and outputs
|
||||
|
||||
These examples are in the context of the mapping:
|
||||
|
||||
| Name | Value |
|
||||
|-------------|------------|
|
||||
| `VAR_A` | `"A"` |
|
||||
| `VAR_B` | `"B"` |
|
||||
| `VAR_C` | `"C"` |
|
||||
| `VAR_REF` | `$(VAR_A)` |
|
||||
| `VAR_EMPTY` | `""` |
|
||||
|
||||
No other variables are defined.
|
||||
|
||||
| Input | Result |
|
||||
|--------------------------------|----------------------------|
|
||||
| `"$(VAR_A)"` | `"A"` |
|
||||
| `"___$(VAR_B)___"` | `"___B___"` |
|
||||
| `"___$(VAR_C)"` | `"___C"` |
|
||||
| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` |
|
||||
| `"$(VAR_A)-1"` | `"A-1"` |
|
||||
| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` |
|
||||
| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` |
|
||||
| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` |
|
||||
| `"f000-$$VAR_A"` | `"f000-$VAR_A"` |
|
||||
| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` |
|
||||
| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` |
|
||||
| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` |
|
||||
| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` |
|
||||
| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` |
|
||||
| `"$(VAR_REF)"` | `"$(VAR_A)"` |
|
||||
| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` |
|
||||
| `"foo$(VAR_EMPTY)bar"` | `"foobar"` |
|
||||
| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` |
|
||||
| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` |
|
||||
| `"$?_boo_$!"` | `"$?_boo_$!"` |
|
||||
| `"$VAR_A"` | `"$VAR_A"` |
|
||||
| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` |
|
||||
| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` |
|
||||
| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` |
|
||||
| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` |
|
||||
| `"$$$$$$$(VAR_A)"` | `"$$$A"` |
|
||||
| `"$VAR_A)"` | `"$VAR_A)"` |
|
||||
| `"${VAR_A}"` | `"${VAR_A}"` |
|
||||
| `"$(VAR_B)_______$(A"` | `"B_______$(A"` |
|
||||
| `"$(VAR_C)_______$("` | `"C_______$("` |
|
||||
| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` |
|
||||
| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` |
|
||||
| `"--$($($($($--"` | `"--$($($($($--"` |
|
||||
| `"$($($($($--foo$("` | `"$($($($($--foo$("` |
|
||||
| `"foo0--$($($($("` | `"foo0--$($($($("` |
|
||||
| `"$(foo$$var)` | `$(foo$$var)` |
|
||||
|
||||
#### In a pod: building a URL
|
||||
|
||||
Notice the `$(var)` syntax.
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: expansion-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: test-container
|
||||
image: gcr.io/google_containers/busybox
|
||||
command: [ "/bin/sh", "-c", "env" ]
|
||||
env:
|
||||
- name: PUBLIC_URL
|
||||
value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)"
|
||||
restartPolicy: Never
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
#### In a pod: building a URL using downward API
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: expansion-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: test-container
|
||||
image: gcr.io/google_containers/busybox
|
||||
command: [ "/bin/sh", "-c", "env" ]
|
||||
env:
|
||||
- name: POD_NAMESPACE
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: "metadata.namespace"
|
||||
- name: PUBLIC_URL
|
||||
value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)"
|
||||
restartPolicy: Never
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,222 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Adding custom resources to the Kubernetes API server"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Adding custom resources to the Kubernetes API server
|
||||
|
||||
This document describes the design for implementing the storage of custom API types in the Kubernetes API Server.
|
||||
|
||||
|
||||
## Resource Model
|
||||
|
||||
### The ThirdPartyResource
|
||||
|
||||
The `ThirdPartyResource` resource describes the multiple versions of a custom resource that the user wants to add
|
||||
to the Kubernetes API. `ThirdPartyResource` is a non-namespaced resource, attempting to place it in a resource
|
||||
will return an error.
|
||||
|
||||
Each `ThirdPartyResource` resource has the following:
|
||||
* Standard Kubernetes object metadata.
|
||||
* ResourceKind - The kind of the resources described by this third party resource.
|
||||
* Description - A free text description of the resource.
|
||||
* APIGroup - An API group that this resource should be placed into.
|
||||
* Versions - One or more `Version` objects.
|
||||
|
||||
### The `Version` Object
|
||||
|
||||
The `Version` object describes a single concrete version of a custom resource. The `Version` object currently
|
||||
only specifies:
|
||||
* The `Name` of the version.
|
||||
* The `APIGroup` this version should belong to.
|
||||
|
||||
## Expectations about third party objects
|
||||
|
||||
Every object that is added to a third-party Kubernetes object store is expected to contain Kubernetes
|
||||
compatible [object metadata](../devel/api-conventions.html#metadata). This requirement enables the
|
||||
Kubernetes API server to provide the following features:
|
||||
* Filtering lists of objects via LabelQueries
|
||||
* `resourceVersion`-based optimistic concurrency via compare-and-swap
|
||||
* Versioned storage
|
||||
* Event recording
|
||||
* Integration with basic `kubectl` command line tooling.
|
||||
* Watch for resource changes.
|
||||
|
||||
The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be
|
||||
programmatically convertible to the name of the resource using
|
||||
the following conversion. Kinds are expected to be of the form `<CamelCaseKind>`, the
|
||||
`APIVersion` for the object is expected to be `<domain-name>/<api-group>/<api-version>`.
|
||||
|
||||
For example `example.com/stable/v1`
|
||||
|
||||
`domain-name` is expected to be a fully qualified domain name.
|
||||
|
||||
'CamelCaseKind' is the specific type name.
|
||||
|
||||
To convert this into the `metadata.name` for the `ThirdPartyResource` resource instance,
|
||||
the `<domain-name>` is copied verbatim, the `CamelCaseKind` is
|
||||
then converted
|
||||
using '-' instead of capitalization ('camel-case'), with the first character being assumed to be
|
||||
capitalized. In pseudo code:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
var result string
|
||||
for ix := range kindName {
|
||||
if isCapital(kindName[ix]) {
|
||||
result = append(result, '-')
|
||||
}
|
||||
result = append(result, toLowerCase(kindName[ix])
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
As a concrete example, the resource named `camel-case-kind.example.com` defines resources of Kind `CamelCaseKind`, in
|
||||
the APIGroup with the prefix `example.com/...`.
|
||||
|
||||
The reason for this is to enable rapid lookup of a `ThirdPartyResource` object given the kind information.
|
||||
This is also the reason why `ThirdPartyResource` is not namespaced.
|
||||
|
||||
## Usage
|
||||
|
||||
When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts by creating a new, namespaced
|
||||
RESTful resource path. For now, non-namespaced objects are not supported. As with existing built-in objects
|
||||
deleting a namespace, deletes all third party resources in that namespace.
|
||||
|
||||
For example, if a user creates:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
metadata:
|
||||
name: cron-tab.example.com
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: ThirdPartyResource
|
||||
description: "A specification of a Pod to run on a cron style schedule"
|
||||
versions:
|
||||
- name: stable/v1
|
||||
- name: experimental/v2
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Then the API server will program in two new RESTful resource paths:
|
||||
* `/thirdparty/example.com/stable/v1/namespaces/<namespace>/crontabs/...`
|
||||
* `/thirdparty/example.com/experimental/v2/namespaces/<namespace>/crontabs/...`
|
||||
|
||||
|
||||
Now that this schema has been created, a user can `POST`:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"metadata": {
|
||||
"name": "my-new-cron-object"
|
||||
},
|
||||
"apiVersion": "example.com/stable/v1",
|
||||
"kind": "CronTab",
|
||||
"cronSpec": "* * * * /5",
|
||||
"image": "my-awesome-chron-image"
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
to: `/third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object`
|
||||
|
||||
and the corresponding data will be stored into etcd by the APIServer, so that when the user issues:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
GET /third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object`
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
And when they do that, they will get back the same data, but with additional Kubernetes metadata
|
||||
(e.g. `resourceVersion`, `createdTimestamp`) filled in.
|
||||
|
||||
Likewise, to list all resources, a user can issue:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
GET /third-party/example.com/stable/v1/namespaces/default/crontabs
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
and get back:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"apiVersion": "example.com/stable/v1",
|
||||
"kind": "CronTabList",
|
||||
"items": [
|
||||
{
|
||||
"metadata": {
|
||||
"name": "my-new-cron-object"
|
||||
},
|
||||
"apiVersion": "example.com/stable/v1",
|
||||
"kind": "CronTab",
|
||||
"cronSpec": "* * * * /5",
|
||||
"image": "my-awesome-chron-image"
|
||||
}
|
||||
]
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Because all objects are expected to contain standard Kubernetes metadata fields, these
|
||||
list operations can also use `Label` queries to filter requests down to specific subsets.
|
||||
|
||||
Likewise, clients can use watch endpoints to watch for changes to stored objects.
|
||||
|
||||
|
||||
## Storage
|
||||
|
||||
In order to store custom user data in a versioned fashion inside of etcd, we need to also introduce a
|
||||
`Codec`-compatible object for persistent storage in etcd. This object is `ThirdPartyResourceData` and it contains:
|
||||
* Standard API Metadata
|
||||
* `Data`: The raw JSON data for this custom object.
|
||||
|
||||
### Storage key specification
|
||||
|
||||
Each custom object stored by the API server needs a custom key in storage, this is described below:
|
||||
|
||||
#### Definitions
|
||||
|
||||
* `resource-namespace` : the namespace of the particular resource that is being stored
|
||||
* `resource-name`: the name of the particular resource being stored
|
||||
* `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` resource that represents the type for the specific instance being stored.
|
||||
* `third-party-resource-name`: the name of the `ThirdPartyResource` resource that represents the type for the specific instance being stored.
|
||||
|
||||
#### Key
|
||||
|
||||
Given the definitions above, the key for a specific third-party object is:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name}
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Thus, listing a third-party resource can be achieved by listing the directory:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,264 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Horizontal Pod Autoscaling"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Horizontal Pod Autoscaling
|
||||
|
||||
## Preface
|
||||
|
||||
This document briefly describes the design of the horizontal autoscaler for pods.
|
||||
The autoscaler (implemented as a Kubernetes API resource and controller) is responsible for dynamically controlling
|
||||
the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s),
|
||||
for example a target per-pod CPU utilization.
|
||||
|
||||
This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.1/docs/proposals/autoscaling.md).
|
||||
|
||||
## Overview
|
||||
|
||||
The resource usage of a serving application usually varies over time: sometimes the demand for the application rises,
|
||||
and sometimes it drops.
|
||||
In Kubernetes version 1.0, a user can only manually set the number of serving pods.
|
||||
Our aim is to provide a mechanism for the automatic adjustment of the number of pods based on CPU utilization statistics
|
||||
(a future version will allow autoscaling based on other resources/metrics).
|
||||
|
||||
## Scale Subresource
|
||||
|
||||
In Kubernetes version 1.1, we are introducing Scale subresource and implementing horizontal autoscaling of pods based on it.
|
||||
Scale subresource is supported for replication controllers and deployments.
|
||||
Scale subresource is a Virtual Resource (does not correspond to an object stored in etcd).
|
||||
It is only present in the API as an interface that a controller (in this case the HorizontalPodAutoscaler) can use to dynamically scale
|
||||
the number of replicas controlled by some other API object (currently ReplicationController and Deployment) and to learn the current number of replicas.
|
||||
Scale is a subresource of the API object that it serves as the interface for.
|
||||
The Scale subresource is useful because whenever we introduce another type we want to autoscale, we just need to implement the Scale subresource for it.
|
||||
The wider discussion regarding Scale took place in [#1629](https://github.com/kubernetes/kubernetes/issues/1629).
|
||||
|
||||
Scale subresource is in API for replication controller or deployment under the following paths:
|
||||
|
||||
`apis/extensions/v1beta1/replicationcontrollers/myrc/scale`
|
||||
|
||||
`apis/extensions/v1beta1/deployments/mydeployment/scale`
|
||||
|
||||
It has the following structure:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
// represents a scaling request for a resource.
|
||||
type Scale struct {
|
||||
unversioned.TypeMeta
|
||||
api.ObjectMeta
|
||||
|
||||
// defines the behavior of the scale.
|
||||
Spec ScaleSpec
|
||||
|
||||
// current status of the scale.
|
||||
Status ScaleStatus
|
||||
}
|
||||
|
||||
// describes the attributes of a scale subresource
|
||||
type ScaleSpec struct {
|
||||
// desired number of instances for the scaled object.
|
||||
Replicas int `json:"replicas,omitempty"`
|
||||
}
|
||||
|
||||
// represents the current status of a scale subresource.
|
||||
type ScaleStatus struct {
|
||||
// actual number of observed instances of the scaled object.
|
||||
Replicas int `json:"replicas"`
|
||||
|
||||
// label query over pods that should match the replicas count.
|
||||
Selector map[string]string `json:"selector,omitempty"`
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment associated with
|
||||
the given Scale subresource.
|
||||
`ScaleStatus.Replicas` reports how many pods are currently running in the replication controller/deployment,
|
||||
and `ScaleStatus.Selector` returns selector for the pods.
|
||||
|
||||
## HorizontalPodAutoscaler Object
|
||||
|
||||
In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It is accessible under:
|
||||
|
||||
`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler`
|
||||
|
||||
It has the following structure:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
// configuration of a horizontal pod autoscaler.
|
||||
type HorizontalPodAutoscaler struct {
|
||||
unversioned.TypeMeta
|
||||
api.ObjectMeta
|
||||
|
||||
// behavior of autoscaler.
|
||||
Spec HorizontalPodAutoscalerSpec
|
||||
|
||||
// current information about the autoscaler.
|
||||
Status HorizontalPodAutoscalerStatus
|
||||
}
|
||||
|
||||
// specification of a horizontal pod autoscaler.
|
||||
type HorizontalPodAutoscalerSpec struct {
|
||||
// reference to Scale subresource; horizontal pod autoscaler will learn the current resource
|
||||
// consumption from its status,and will set the desired number of pods by modifying its spec.
|
||||
ScaleRef SubresourceReference
|
||||
// lower limit for the number of pods that can be set by the autoscaler, default 1.
|
||||
MinReplicas *int
|
||||
// upper limit for the number of pods that can be set by the autoscaler.
|
||||
// It cannot be smaller than MinReplicas.
|
||||
MaxReplicas int
|
||||
// target average CPU utilization (represented as a percentage of requested CPU) over all the pods;
|
||||
// if not specified it defaults to the target CPU utilization at 80% of the requested resources.
|
||||
CPUUtilization *CPUTargetUtilization
|
||||
}
|
||||
|
||||
type CPUTargetUtilization struct {
|
||||
// fraction of the requested CPU that should be utilized/used,
|
||||
// e.g. 70 means that 70% of the requested CPU should be in use.
|
||||
TargetPercentage int
|
||||
}
|
||||
|
||||
// current status of a horizontal pod autoscaler
|
||||
type HorizontalPodAutoscalerStatus struct {
|
||||
// most recent generation observed by this autoscaler.
|
||||
ObservedGeneration *int64
|
||||
|
||||
// last time the HorizontalPodAutoscaler scaled the number of pods;
|
||||
// used by the autoscaler to control how often the number of pods is changed.
|
||||
LastScaleTime *unversioned.Time
|
||||
|
||||
// current number of replicas of pods managed by this autoscaler.
|
||||
CurrentReplicas int
|
||||
|
||||
// desired number of replicas of pods managed by this autoscaler.
|
||||
DesiredReplicas int
|
||||
|
||||
// current average CPU utilization over all pods, represented as a percentage of requested CPU,
|
||||
// e.g. 70 means that an average pod is using now 70% of its requested CPU.
|
||||
CurrentCPUUtilizationPercentage *int
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
`ScaleRef` is a reference to the Scale subresource.
|
||||
`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler configuration.
|
||||
We are also introducing HorizontalPodAutoscalerList object to enable listing all autoscalers in a namespace:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
// list of horizontal pod autoscaler objects.
|
||||
type HorizontalPodAutoscalerList struct {
|
||||
unversioned.TypeMeta
|
||||
unversioned.ListMeta
|
||||
|
||||
// list of horizontal pod autoscaler objects.
|
||||
Items []HorizontalPodAutoscaler
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
## Autoscaling Algorithm
|
||||
|
||||
The autoscaler is implemented as a control loop. It periodically queries pods described by `Status.PodSelector` of Scale subresource, and collects their CPU utilization.
|
||||
Then, it compares the arithmetic mean of the pods' CPU utilization with the target defined in `Spec.CPUUtilization`,
|
||||
and adjust the replicas of the Scale if needed to match the target
|
||||
(preserving condition: MinReplicas <= Replicas <= MaxReplicas).
|
||||
|
||||
The period of the autoscaler is controlled by `--horizontal-pod-autoscaler-sync-period` flag of controller manager.
|
||||
The default value is 30 seconds.
|
||||
|
||||
|
||||
CPU utilization is the recent CPU usage of a pod (average across the last 1 minute) divided by the CPU requested by the pod.
|
||||
In Kubernetes version 1.1, CPU usage is taken directly from Heapster.
|
||||
In future, there will be API on master for this purpose
|
||||
(see [#11951](https://github.com/kubernetes/kubernetes/issues/11951)).
|
||||
|
||||
The target number of pods is calculated from the following formula:
|
||||
|
||||
```
|
||||
{% raw %}
|
||||
TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
Starting and stopping pods may introduce noise to the metric (for instance, starting may temporarily increase CPU).
|
||||
So, after each action, the autoscaler should wait some time for reliable data.
|
||||
Scale-up can only happen if there was no rescaling within the last 3 minutes.
|
||||
Scale-down will wait for 5 minutes from the last rescaling.
|
||||
Moreover any scaling will only be made if: `avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 (10% tolerance).
|
||||
Such approach has two benefits:
|
||||
|
||||
* Autoscaler works in a conservative way.
|
||||
If new user load appears, it is important for us to rapidly increase the number of pods,
|
||||
so that user requests will not be rejected.
|
||||
Lowering the number of pods is not that urgent.
|
||||
|
||||
* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting decision if the load is not stable.
|
||||
|
||||
## Relative vs. absolute metrics
|
||||
|
||||
We chose values of the target metric to be relative (e.g. 90% of requested CPU resource) rather than absolute (e.g. 0.6 core) for the following reason.
|
||||
If we choose absolute metric, user will need to guarantee that the target is lower than the request.
|
||||
Otherwise, overloaded pods may not be able to consume more than the autoscaler's absolute target utilization,
|
||||
thereby preventing the autoscaler from seeing high enough utilization to trigger it to scale up.
|
||||
This may be especially troublesome when user changes requested resources for a pod
|
||||
because they would need to also change the autoscaler utilization threshold.
|
||||
Therefore, we decided to choose relative metric.
|
||||
For user, it is enough to set it to a value smaller than 100%, and further changes of requested resources will not invalidate it.
|
||||
|
||||
## Support in kubectl
|
||||
|
||||
To make manipulation of HorizontalPodAutoscaler object simpler, we added support for
|
||||
creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl.
|
||||
In addition, in future, we are planning to add kubectl support for the following use-cases:
|
||||
* When creating a replication controller or deployment with `kubectl create [-f]`, there should be
|
||||
a possibility to specify an additional autoscaler object.
|
||||
(This should work out-of-the-box when creation of autoscaler is supported by kubectl as we may include
|
||||
multiple objects in the same config file).
|
||||
* *[future]* When running an image with `kubectl run`, there should be an additional option to create
|
||||
an autoscaler for it.
|
||||
* *[future]* We will add a new command `kubectl autoscale` that will allow for easy creation of an autoscaler object
|
||||
for already existing replication controller/deployment.
|
||||
|
||||
## Next steps
|
||||
|
||||
We list here some features that are not supported in Kubernetes version 1.1.
|
||||
However, we want to keep them in mind, as they will most probably be needed in future.
|
||||
Our design is in general compatible with them.
|
||||
* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. memory, network traffic, qps).
|
||||
This includes scaling based on a custom/application metric.
|
||||
* *[future]* **Autoscale pods base on an aggregate metric.**
|
||||
Autoscaler, instead of computing average for a target metric across pods, will use a single, external, metric (e.g. qps metric from load balancer).
|
||||
The metric will be aggregated while the target will remain per-pod
|
||||
(e.g. when observing 100 qps on load balancer while the target is 20 qps per pod, autoscaler will set the number of replicas to 5).
|
||||
* *[future]* **Autoscale pods based on multiple metrics.**
|
||||
If the target numbers of pods for different metrics are different, choose the largest target number of pods.
|
||||
* *[future]* **Scale the number of pods starting from 0.**
|
||||
All pods can be turned-off, and then turned-on when there is a demand for them.
|
||||
When a request to service with no pods arrives, kube-proxy will generate an event for autoscaler
|
||||
to create a new pod.
|
||||
Discussed in [#3247](https://github.com/kubernetes/kubernetes/issues/3247).
|
||||
* *[future]* **When scaling down, make more educated decision which pods to kill.**
|
||||
E.g.: if two or more pods from the same replication controller are on the same node, kill one of them.
|
||||
Discussed in [#4301](https://github.com/kubernetes/kubernetes/issues/4301).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,114 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Identifiers and Names in Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Identifiers and Names in Kubernetes
|
||||
|
||||
A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](http://issue.k8s.io/199).
|
||||
|
||||
|
||||
## Definitions
|
||||
|
||||
UID
|
||||
: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities.
|
||||
|
||||
Name
|
||||
: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations.
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL)
|
||||
: An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN)
|
||||
: One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters
|
||||
|
||||
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID)
|
||||
: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination
|
||||
|
||||
[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) port name (IANA_SVC_NAME)
|
||||
: An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, with the '-' character allowed anywhere except the first or the last character or adjacent to another '-' character, it must contain at least a (a-z) character
|
||||
|
||||
## Objectives for names and UIDs
|
||||
|
||||
1. Uniquely identify (via a UID) an object across space and time
|
||||
|
||||
2. Uniquely name (via a name) an object across space
|
||||
|
||||
3. Provide human-friendly names in API operations and/or configuration files
|
||||
|
||||
4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects
|
||||
|
||||
5. Allow DNS names to be automatically generated for some objects
|
||||
|
||||
|
||||
## General design
|
||||
|
||||
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency.
|
||||
* Examples: "guestbook.user", "backend-x4eb1"
|
||||
|
||||
2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random).
|
||||
* Example: "api.k8s.example.com"
|
||||
|
||||
3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time.
|
||||
* Example: "01234567-89ab-cdef-0123-456789abcdef"
|
||||
|
||||
|
||||
## Case study: Scheduling a pod
|
||||
|
||||
Pods can be placed onto a particular node in a number of ways. This case
|
||||
study demonstrates how the above design can be applied to satisfy the
|
||||
objectives.
|
||||
|
||||
### A pod scheduled by a user through the apiserver
|
||||
|
||||
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
|
||||
|
||||
2. The apiserver validates the input.
|
||||
1. A default Namespace is assigned.
|
||||
2. The pod name must be space-unique within the Namespace.
|
||||
3. Each container within the pod has a name which must be space-unique within the pod.
|
||||
|
||||
3. The pod is accepted.
|
||||
1. A new UID is assigned.
|
||||
|
||||
4. The pod is bound to a node.
|
||||
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
|
||||
|
||||
5. Kubelet validates the input.
|
||||
|
||||
6. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
* This may correspond to Docker's container ID.
|
||||
|
||||
### A pod placed by a config file on the node
|
||||
|
||||
1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor".
|
||||
|
||||
2. Kubelet validates the input.
|
||||
1. Since UID is not provided, kubelet generates one.
|
||||
2. Since Namespace is not provided, kubelet generates one.
|
||||
1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path.
|
||||
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
|
||||
|
||||
3. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
1. This may correspond to Docker's container ID.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,39 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Kubernetes Design Overview"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Kubernetes Design Overview
|
||||
|
||||
Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.
|
||||
|
||||
Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration.
|
||||
|
||||
Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways.
|
||||
|
||||
Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary.
|
||||
|
||||
Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts.
|
||||
|
||||
A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the multi-cluster doc](../admin/multi-cluster.html) and [cluster federation proposal](../proposals/federation.html) for more details).
|
||||
|
||||
Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner.
|
||||
|
||||
For more about the Kubernetes architecture, see [architecture](architecture.html).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,371 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Namespaces"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Namespaces
|
||||
|
||||
## Abstract
|
||||
|
||||
A Namespace is a mechanism to partition resources created by users into
|
||||
a logically named group.
|
||||
|
||||
## Motivation
|
||||
|
||||
A single cluster should be able to satisfy the needs of multiple user communities.
|
||||
|
||||
Each user community wants to be able to work in isolation from other communities.
|
||||
|
||||
Each user community has its own:
|
||||
|
||||
1. resources (pods, services, replication controllers, etc.)
|
||||
2. policies (who can or cannot perform actions in their community)
|
||||
3. constraints (this community is allowed this much quota, etc.)
|
||||
|
||||
A cluster operator may create a Namespace for each unique user community.
|
||||
|
||||
The Namespace provides a unique scope for:
|
||||
|
||||
1. named resources (to avoid basic naming collisions)
|
||||
2. delegated management authority to trusted users
|
||||
3. ability to limit community resource consumption
|
||||
|
||||
## Use cases
|
||||
|
||||
1. As a cluster operator, I want to support multiple user communities on a single cluster.
|
||||
2. As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users
|
||||
in those communities.
|
||||
3. As a cluster operator, I want to limit the amount of resources each community can consume in order
|
||||
to limit the impact to other communities using the cluster.
|
||||
4. As a cluster user, I want to interact with resources that are pertinent to my user community in
|
||||
isolation of what other user communities are doing on the cluster.
|
||||
|
||||
## Design
|
||||
|
||||
### Data Model
|
||||
|
||||
A *Namespace* defines a logically named group for multiple *Kind*s of resources.
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type Namespace struct {
|
||||
TypeMeta `json:",inline"`
|
||||
ObjectMeta `json:"metadata,omitempty"`
|
||||
|
||||
Spec NamespaceSpec `json:"spec,omitempty"`
|
||||
Status NamespaceStatus `json:"status,omitempty"`
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
A *Namespace* name is a DNS compatible label.
|
||||
|
||||
A *Namespace* must exist prior to associating content with it.
|
||||
|
||||
A *Namespace* must not be deleted if there is content associated with it.
|
||||
|
||||
To associate a resource with a *Namespace* the following conditions must be satisfied:
|
||||
|
||||
1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with the server
|
||||
2. The resource's *TypeMeta.Namespace* field must have a value that references an existing *Namespace*
|
||||
|
||||
The *Name* of a resource associated with a *Namespace* is unique to that *Kind* in that *Namespace*.
|
||||
|
||||
It is intended to be used in resource URLs; provided by clients at creation time, and encouraged to be
|
||||
human friendly; intended to facilitate idempotent creation, space-uniqueness of singleton objects,
|
||||
distinguish distinct entities, and reference particular entities across operations.
|
||||
|
||||
### Authorization
|
||||
|
||||
A *Namespace* provides an authorization scope for accessing content associated with the *Namespace*.
|
||||
|
||||
See [Authorization plugins](../admin/authorization.html)
|
||||
|
||||
### Limit Resource Consumption
|
||||
|
||||
A *Namespace* provides a scope to limit resource consumption.
|
||||
|
||||
A *LimitRange* defines min/max constraints on the amount of resources a single entity can consume in
|
||||
a *Namespace*.
|
||||
|
||||
See [Admission control: Limit Range](admission_control_limit_range.html)
|
||||
|
||||
A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and allows cluster operators
|
||||
to define *Hard* resource usage limits that a *Namespace* may consume.
|
||||
|
||||
See [Admission control: Resource Quota](admission_control_resource_quota.html)
|
||||
|
||||
### Finalizers
|
||||
|
||||
Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* objects.
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type FinalizerName string
|
||||
|
||||
// These are internal finalizers to Kubernetes, must be qualified name unless defined here
|
||||
const (
|
||||
FinalizerKubernetes FinalizerName = "kubernetes"
|
||||
)
|
||||
|
||||
// NamespaceSpec describes the attributes on a Namespace
|
||||
type NamespaceSpec struct {
|
||||
// Finalizers is an opaque list of values that must be empty to permanently remove object from storage
|
||||
Finalizers []FinalizerName
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
A *FinalizerName* is a qualified name.
|
||||
|
||||
The API Server enforces that a *Namespace* can only be deleted from storage if and only if
|
||||
it's *Namespace.Spec.Finalizers* is empty.
|
||||
|
||||
A *finalize* operation is the only mechanism to modify the *Namespace.Spec.Finalizers* field post creation.
|
||||
|
||||
Each *Namespace* created has *kubernetes* as an item in its list of initial *Namespace.Spec.Finalizers*
|
||||
set by default.
|
||||
|
||||
### Phases
|
||||
|
||||
A *Namespace* may exist in the following phases.
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type NamespacePhase string
|
||||
const(
|
||||
NamespaceActive NamespacePhase = "Active"
|
||||
NamespaceTerminating NamespaceTerminating = "Terminating"
|
||||
)
|
||||
|
||||
type NamespaceStatus struct {
|
||||
...
|
||||
Phase NamespacePhase
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
A *Namespace* is in the **Active** phase if it does not have a *ObjectMeta.DeletionTimestamp*.
|
||||
|
||||
A *Namespace* is in the **Terminating** phase if it has a *ObjectMeta.DeletionTimestamp*.
|
||||
|
||||
**Active**
|
||||
|
||||
Upon creation, a *Namespace* goes in the *Active* phase. This means that content may be associated with
|
||||
a namespace, and all normal interactions with the namespace are allowed to occur in the cluster.
|
||||
|
||||
If a DELETE request occurs for a *Namespace*, the *Namespace.ObjectMeta.DeletionTimestamp* is set
|
||||
to the current server time. A *namespace controller* observes the change, and sets the *Namespace.Status.Phase*
|
||||
to *Terminating*.
|
||||
|
||||
**Terminating**
|
||||
|
||||
A *namespace controller* watches for *Namespace* objects that have a *Namespace.ObjectMeta.DeletionTimestamp*
|
||||
value set in order to know when to initiate graceful termination of the *Namespace* associated content that
|
||||
are known to the cluster.
|
||||
|
||||
The *namespace controller* enumerates each known resource type in that namespace and deletes it one by one.
|
||||
|
||||
Admission control blocks creation of new resources in that namespace in order to prevent a race-condition
|
||||
where the controller could believe all of a given resource type had been deleted from the namespace,
|
||||
when in fact some other rogue client agent had created new objects. Using admission control in this
|
||||
scenario allows each of registry implementations for the individual objects to not need to take into account Namespace life-cycle.
|
||||
|
||||
Once all objects known to the *namespace controller* have been deleted, the *namespace controller*
|
||||
executes a *finalize* operation on the namespace that removes the *kubernetes* value from
|
||||
the *Namespace.Spec.Finalizers* list.
|
||||
|
||||
If the *namespace controller* sees a *Namespace* whose *ObjectMeta.DeletionTimestamp* is set, and
|
||||
whose *Namespace.Spec.Finalizers* list is empty, it will signal the server to permanently remove
|
||||
the *Namespace* from storage by sending a final DELETE action to the API server.
|
||||
|
||||
### REST API
|
||||
|
||||
To interact with the Namespace API:
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ------ | --------- | ---- | ----------- |
|
||||
| CREATE | POST | /api/{version}/namespaces | Create a namespace |
|
||||
| LIST | GET | /api/{version}/namespaces | List all namespaces |
|
||||
| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} |
|
||||
| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} |
|
||||
| FINALIZE | POST | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} |
|
||||
| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces |
|
||||
|
||||
This specification reserves the name *finalize* as a sub-resource to namespace.
|
||||
|
||||
As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*.
|
||||
|
||||
To interact with content associated with a Namespace:
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} |
|
||||
| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} |
|
||||
| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} |
|
||||
| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} |
|
||||
| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} |
|
||||
| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} |
|
||||
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
|
||||
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
|
||||
|
||||
The API server verifies the *Namespace* on resource creation matches the *{namespace}* on the path.
|
||||
|
||||
The API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context
|
||||
of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request,
|
||||
then the API server will reject the request.
|
||||
|
||||
### Storage
|
||||
|
||||
A namespace provides a unique identifier space and therefore must be in the storage path of a resource.
|
||||
|
||||
In etcd, we want to continue to still support efficient WATCH across namespaces.
|
||||
|
||||
Resources that persist content in etcd will have storage paths as follows:
|
||||
|
||||
/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name}
|
||||
|
||||
This enables consumers to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}.
|
||||
|
||||
### Kubelet
|
||||
|
||||
The kubelet will register pod's it sources from a file or http source with a namespace associated with the
|
||||
*cluster-id*
|
||||
|
||||
### Example: OpenShift Origin managing a Kubernetes Namespace
|
||||
|
||||
In this example, we demonstrate how the design allows for agents built on-top of
|
||||
Kubernetes that manage their own set of resource types associated with a *Namespace*
|
||||
to take part in Namespace termination.
|
||||
|
||||
OpenShift creates a Namespace in Kubernetes
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"apiVersion":"v1",
|
||||
"kind": "Namespace",
|
||||
"metadata": {
|
||||
"name": "development",
|
||||
"labels": {
|
||||
"name": "development"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"finalizers": ["openshift.com/origin", "kubernetes"]
|
||||
},
|
||||
"status": {
|
||||
"phase": "Active"
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
OpenShift then goes and creates a set of resources (pods, services, etc) associated
|
||||
with the "development" namespace. It also creates its own set of resources in its
|
||||
own storage associated with the "development" namespace unknown to Kubernetes.
|
||||
|
||||
User deletes the Namespace in Kubernetes, and Namespace now has following state:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"apiVersion":"v1",
|
||||
"kind": "Namespace",
|
||||
"metadata": {
|
||||
"name": "development",
|
||||
"deletionTimestamp": "..."
|
||||
"labels": {
|
||||
"name": "development"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"finalizers": ["openshift.com/origin", "kubernetes"]
|
||||
},
|
||||
"status": {
|
||||
"phase": "Terminating"
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The Kubernetes *namespace controller* observes the namespace has a *deletionTimestamp*
|
||||
and begins to terminate all of the content in the namespace that it knows about. Upon
|
||||
success, it executes a *finalize* action that modifies the *Namespace* by
|
||||
removing *kubernetes* from the list of finalizers:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"apiVersion":"v1",
|
||||
"kind": "Namespace",
|
||||
"metadata": {
|
||||
"name": "development",
|
||||
"deletionTimestamp": "..."
|
||||
"labels": {
|
||||
"name": "development"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"finalizers": ["openshift.com/origin"]
|
||||
},
|
||||
"status": {
|
||||
"phase": "Terminating"
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
OpenShift Origin has its own *namespace controller* that is observing cluster state, and
|
||||
it observes the same namespace had a *deletionTimestamp* assigned to it. It too will go
|
||||
and purge resources from its own storage that it manages associated with that namespace.
|
||||
Upon completion, it executes a *finalize* action and removes the reference to "openshift.com/origin"
|
||||
from the list of finalizers.
|
||||
|
||||
This results in the following state:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"apiVersion":"v1",
|
||||
"kind": "Namespace",
|
||||
"metadata": {
|
||||
"name": "development",
|
||||
"deletionTimestamp": "..."
|
||||
"labels": {
|
||||
"name": "development"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"finalizers": []
|
||||
},
|
||||
"status": {
|
||||
"phase": "Terminating"
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
At this point, the Kubernetes *namespace controller* in its sync loop will see that the namespace
|
||||
has a deletion timestamp and that its list of finalizers is empty. As a result, it knows all
|
||||
content associated from that namespace has been purged. It performs a final DELETE action
|
||||
to remove that Namespace from the storage.
|
||||
|
||||
At this point, all content associated with that Namespace, and the Namespace itself are gone.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,200 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Networking"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Networking
|
||||
|
||||
There are 4 distinct networking problems to solve:
|
||||
|
||||
1. Highly-coupled container-to-container communications
|
||||
2. Pod-to-Pod communications
|
||||
3. Pod-to-Service communications
|
||||
4. External-to-internal communications
|
||||
|
||||
## Model and motivation
|
||||
|
||||
Kubernetes deviates from the default Docker networking model (though as of
|
||||
Docker 1.8 their network plugins are getting closer). The goal is for each pod
|
||||
to have an IP in a flat shared networking namespace that has full communication
|
||||
with other physical computers and containers across the network. IP-per-pod
|
||||
creates a clean, backward-compatible model where pods can be treated much like
|
||||
VMs or physical hosts from the perspectives of port allocation, networking,
|
||||
naming, service discovery, load balancing, application configuration, and
|
||||
migration.
|
||||
|
||||
Dynamic port allocation, on the other hand, requires supporting both static
|
||||
ports (e.g., for externally accessible services) and dynamically allocated
|
||||
ports, requires partitioning centrally allocated and locally acquired dynamic
|
||||
ports, complicates scheduling (since ports are a scarce resource), is
|
||||
inconvenient for users, complicates application configuration, is plagued by
|
||||
port conflicts and reuse and exhaustion, requires non-standard approaches to
|
||||
naming (e.g. consul or etcd rather than DNS), requires proxies and/or
|
||||
redirection for programs using standard naming/addressing mechanisms (e.g. web
|
||||
browsers), requires watching and cache invalidation for address/port changes
|
||||
for instances in addition to watching group membership changes, and obstructs
|
||||
container/pod migration (e.g. using CRIU). NAT introduces additional complexity
|
||||
by fragmenting the addressing space, which breaks self-registration mechanisms,
|
||||
among other problems.
|
||||
|
||||
## Container to container
|
||||
|
||||
All containers within a pod behave as if they are on the same host with regard
|
||||
to networking. They can all reach each other’s ports on localhost. This offers
|
||||
simplicity (static ports know a priori), security (ports bound to localhost
|
||||
are visible within the pod but never outside it), and performance. This also
|
||||
reduces friction for applications moving from the world of uncontainerized apps
|
||||
on physical or virtual hosts. People running application stacks together on
|
||||
the same host have already figured out how to make ports not conflict and have
|
||||
arranged for clients to find them.
|
||||
|
||||
The approach does reduce isolation between containers within a pod —
|
||||
ports could conflict, and there can be no container-private ports, but these
|
||||
seem to be relatively minor issues with plausible future workarounds. Besides,
|
||||
the premise of pods is that containers within a pod share some resources
|
||||
(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation.
|
||||
Additionally, the user can control what containers belong to the same pod
|
||||
whereas, in general, they don't control what pods land together on a host.
|
||||
|
||||
## Pod to pod
|
||||
|
||||
Because every pod gets a "real" (not machine-private) IP address, pods can
|
||||
communicate without proxies or translations. The pod can use well-known port
|
||||
numbers and can avoid the use of higher-level service discovery systems like
|
||||
DNS-SD, Consul, or Etcd.
|
||||
|
||||
When any container calls ioctl(SIOCGIFADDR) (get the address of an interface),
|
||||
it sees the same IP that any peer container would see them coming from —
|
||||
each pod has its own IP address that other pods can know. By making IP addresses
|
||||
and ports the same both inside and outside the pods, we create a NAT-less, flat
|
||||
address space. Running "ip addr show" should work as expected. This would enable
|
||||
all existing naming/discovery mechanisms to work out of the box, including
|
||||
self-registration mechanisms and applications that distribute IP addresses. We
|
||||
should be optimizing for inter-pod network communication. Within a pod,
|
||||
containers are more likely to use communication through volumes (e.g., tmpfs) or
|
||||
IPC.
|
||||
|
||||
This is different from the standard Docker model. In that mode, each container
|
||||
gets an IP in the 172-dot space and would only see that 172-dot address from
|
||||
SIOCGIFADDR. If these containers connect to another container the peer would see
|
||||
the connect coming from a different IP than the container itself knows. In short
|
||||
— you can never self-register anything from a container, because a
|
||||
container can not be reached on its private IP.
|
||||
|
||||
An alternative we considered was an additional layer of addressing: pod-centric
|
||||
IP per container. Each container would have its own local IP address, visible
|
||||
only within that pod. This would perhaps make it easier for containerized
|
||||
applications to move from physical/virtual hosts to pods, but would be more
|
||||
complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS)
|
||||
and to reason about, due to the additional layer of address translation, and
|
||||
would break self-registration and IP distribution mechanisms.
|
||||
|
||||
Like Docker, ports can still be published to the host node's interface(s), but
|
||||
the need for this is radically diminished.
|
||||
|
||||
## Implementation
|
||||
|
||||
For the Google Compute Engine cluster configuration scripts, we use [advanced
|
||||
routing rules](https://developers.google.com/compute/docs/networking#routing)
|
||||
and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that
|
||||
get routed to it. This is in addition to the 'main' IP address assigned to the
|
||||
VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to
|
||||
differentiate it from `docker0`) is set up outside of Docker proper.
|
||||
|
||||
Example of GCE's advanced routing rules:
|
||||
|
||||
{% highlight sh %}
|
||||
{% raw %}
|
||||
gcloud compute routes add "${MINION_NAMES[$i]}" \
|
||||
--project "${PROJECT}" \
|
||||
--destination-range "${MINION_IP_RANGES[$i]}" \
|
||||
--network "${NETWORK}" \
|
||||
--next-hop-instance "${MINION_NAMES[$i]}" \
|
||||
--next-hop-instance-zone "${ZONE}" &
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
GCE itself does not know anything about these IPs, though. This means that when
|
||||
a pod tries to egress beyond GCE's project the packets must be SNAT'ed
|
||||
(masqueraded) to the VM's IP, which GCE recognizes and allows.
|
||||
|
||||
### Other implementations
|
||||
|
||||
With the primary aim of providing IP-per-pod-model, other implementations exist
|
||||
to serve the purpose outside of GCE.
|
||||
- [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.html)
|
||||
- [Flannel](https://github.com/coreos/flannel#flannel)
|
||||
- [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/)
|
||||
("With Linux Bridge devices" section)
|
||||
- [Weave](https://github.com/zettio/weave) is yet another way to build an
|
||||
overlay network, primarily aiming at Docker integration.
|
||||
- [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real
|
||||
container IPs.
|
||||
|
||||
## Pod to service
|
||||
|
||||
The [service](../user-guide/services.html) abstraction provides a way to group pods under a
|
||||
common access policy (e.g. load-balanced). The implementation of this creates a
|
||||
virtual IP which clients can access and which is transparently proxied to the
|
||||
pods in a Service. Each node runs a kube-proxy process which programs
|
||||
`iptables` rules to trap access to service IPs and redirect them to the correct
|
||||
backends. This provides a highly-available load-balancing solution with low
|
||||
performance overhead by balancing client traffic from a node on that same node.
|
||||
|
||||
## External to internal
|
||||
|
||||
So far the discussion has been about how to access a pod or service from within
|
||||
the cluster. Accessing a pod from outside the cluster is a bit more tricky. We
|
||||
want to offer highly-available, high-performance load balancing to target
|
||||
Kubernetes Services. Most public cloud providers are simply not flexible enough
|
||||
yet.
|
||||
|
||||
The way this is generally implemented is to set up external load balancers (e.g.
|
||||
GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When
|
||||
traffic arrives at a node it is recognized as being part of a particular Service
|
||||
and routed to an appropriate backend Pod. This does mean that some traffic will
|
||||
get double-bounced on the network. Once cloud providers have better offerings
|
||||
we can take advantage of those.
|
||||
|
||||
## Challenges and future work
|
||||
|
||||
### Docker API
|
||||
|
||||
Right now, docker inspect doesn't show the networking configuration of the
|
||||
containers, since they derive it from another container. That information should
|
||||
be exposed somehow.
|
||||
|
||||
### External IP assignment
|
||||
|
||||
We want to be able to assign IP addresses externally from Docker
|
||||
[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need
|
||||
to statically allocate fixed-size IP ranges to each node, so that IP addresses
|
||||
can be made stable across pod infra container restarts
|
||||
([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate
|
||||
pod migration. Right now, if the pod infra container dies, all the user
|
||||
containers must be stopped and restarted because the netns of the pod infra
|
||||
container will change on restart, and any subsequent user container restart
|
||||
will join that new netns, thereby not being able to see its peers.
|
||||
Additionally, a change in IP address would encounter DNS caching/TTL problems.
|
||||
External IP assignment would also simplify DNS support (see below).
|
||||
|
||||
### IPv6
|
||||
|
||||
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,240 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Persistent Storage"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Persistent Storage
|
||||
|
||||
This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data.
|
||||
|
||||
### tl;dr
|
||||
|
||||
Two new API kinds:
|
||||
|
||||
A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) for how to use it.
|
||||
|
||||
A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod.
|
||||
|
||||
One new system component:
|
||||
|
||||
`PersistentVolumeClaimBinder` is a singleton running in master that watches all PersistentVolumeClaims in the system and binds them to the closest matching available PersistentVolume. The volume manager watches the API for newly created volumes to manage.
|
||||
|
||||
One new volume:
|
||||
|
||||
`PersistentVolumeClaimVolumeSource` references the user's PVC in the same namespace. This volume finds the bound PV and mounts that volume for the pod. A `PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another type of volume that is owned by someone else (the system).
|
||||
|
||||
Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider.
|
||||
|
||||
### Goals
|
||||
|
||||
* Allow administrators to describe available storage
|
||||
* Allow pod authors to discover and request persistent volumes to use with pods
|
||||
* Enforce security through access control lists and securing storage to the same namespace as the pod volume
|
||||
* Enforce quotas through admission control
|
||||
* Enforce scheduler rules by resource counting
|
||||
* Ensure developers can rely on storage being available without being closely bound to a particular disk, server, network, or storage device.
|
||||
|
||||
|
||||
#### Describe available storage
|
||||
|
||||
Cluster administrators use the API to manage *PersistentVolumes*. A custom store `NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request.
|
||||
|
||||
PVs are system objects and, thus, have no namespace.
|
||||
|
||||
Many means of dynamic provisioning will be eventually be implemented for various storage types.
|
||||
|
||||
|
||||
##### PersistentVolume API
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume |
|
||||
| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} |
|
||||
| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} |
|
||||
| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} |
|
||||
| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume |
|
||||
| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume |
|
||||
|
||||
|
||||
#### Request Storage
|
||||
|
||||
Kubernetes users request persistent storage for their pod by creating a ```PersistentVolumeClaim```. Their request for storage is described by their requirements for resources and mount capabilities.
|
||||
|
||||
Requests for volumes are bound to available volumes by the volume manager, if a suitable match is found. Requests for resources can go unfulfilled.
|
||||
|
||||
Users attach their claim to their pod using a new ```PersistentVolumeClaimVolumeSource``` volume source.
|
||||
|
||||
|
||||
##### PersistentVolumeClaim API
|
||||
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} |
|
||||
| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} |
|
||||
| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} |
|
||||
| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} |
|
||||
| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} |
|
||||
| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} |
|
||||
|
||||
|
||||
|
||||
#### Scheduling constraints
|
||||
|
||||
Scheduling constraints are to be handled similar to pod resource constraints. Pods will need to be annotated or decorated with the number of resources it requires on a node. Similarly, a node will need to list how many it has used or available.
|
||||
|
||||
TBD
|
||||
|
||||
|
||||
#### Events
|
||||
|
||||
The implementation of persistent storage will not require events to communicate to the user the state of their claim. The CLI for bound claims contains a reference to the backing persistent volume. This is always present in the API and CLI, making an event to communicate the same unnecessary.
|
||||
|
||||
Events that communicate the state of a mounted volume are left to the volume plugins.
|
||||
|
||||
|
||||
### Example
|
||||
|
||||
#### Admin provisions storage
|
||||
|
||||
An administrator provisions storage by posting PVs to the API. Various way to automate this task can be scripted. Dynamic provisioning is a future feature that can maintain levels of PVs.
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
POST:
|
||||
|
||||
kind: PersistentVolume
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
name: pv0001
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10
|
||||
persistentDisk:
|
||||
pdName: "abc123"
|
||||
fsType: "ext4"
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get pv
|
||||
|
||||
NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
|
||||
pv0001 map[] 10737418240 RWO Pending
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
#### Users request storage
|
||||
|
||||
A user requests storage by posting a PVC to the API. Their request contains the AccessModes they wish their volume to have and the minimum size needed.
|
||||
|
||||
The user must be within a namespace to create PVCs.
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
POST:
|
||||
|
||||
kind: PersistentVolumeClaim
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
name: myclaim-1
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 3
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get pvc
|
||||
|
||||
NAME LABELS STATUS VOLUME
|
||||
myclaim-1 map[] pending
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
#### Matching and binding
|
||||
|
||||
The ```PersistentVolumeClaimBinder``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl get pv
|
||||
|
||||
NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
|
||||
pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e
|
||||
|
||||
|
||||
kubectl get pvc
|
||||
|
||||
NAME LABELS STATUS VOLUME
|
||||
myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
#### Claim usage
|
||||
|
||||
The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim and mount its volume for a pod.
|
||||
|
||||
The claim holder owns the claim and its data for as long as the claim exists. The pod using the claim can be deleted, but the claim remains in the user's namespace. It can be used again and again by many pods.
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
POST:
|
||||
|
||||
kind: Pod
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
name: mypod
|
||||
spec:
|
||||
containers:
|
||||
- image: nginx
|
||||
name: myfrontend
|
||||
volumeMounts:
|
||||
- mountPath: "/var/www/html"
|
||||
name: mypd
|
||||
volumes:
|
||||
- name: mypd
|
||||
source:
|
||||
persistentVolumeClaim:
|
||||
accessMode: ReadWriteOnce
|
||||
claimRef:
|
||||
name: myclaim-1
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
#### Releasing a claim and Recycling a volume
|
||||
|
||||
When a claim holder is finished with their data, they can delete their claim.
|
||||
|
||||
{% highlight console %}
|
||||
{% raw %}
|
||||
$ kubectl delete pvc myclaim-1
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'.
|
||||
|
||||
Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled.
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,77 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Design Principles"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Design Principles
|
||||
|
||||
Principles to follow when extending Kubernetes.
|
||||
|
||||
## API
|
||||
|
||||
See also the [API conventions](../devel/api-conventions.html).
|
||||
|
||||
* All APIs should be declarative.
|
||||
* API objects should be complementary and composable, not opaque wrappers.
|
||||
* The control plane should be transparent -- there are no hidden internal APIs.
|
||||
* The cost of API operations should be proportional to the number of objects intentionally operated upon. Therefore, common filtered lookups must be indexed. Beware of patterns of multiple API calls that would incur quadratic behavior.
|
||||
* Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation.
|
||||
* Cluster-wide invariants are difficult to enforce correctly. Try not to add them. If you must have them, don't enforce them atomically in master components, that is contention-prone and doesn't provide a recovery path in the case of a bug allowing the invariant to be violated. Instead, provide a series of checks to reduce the probability of a violation, and make every component involved able to recover from an invariant violation.
|
||||
* Low-level APIs should be designed for control by higher-level systems. Higher-level APIs should be intent-oriented (think SLOs) rather than implementation-oriented (think control knobs).
|
||||
|
||||
## Control logic
|
||||
|
||||
* Functionality must be *level-based*, meaning the system must operate correctly given the desired state and the current/observed state, regardless of how many intermediate state updates may have been missed. Edge-triggered behavior must be just an optimization.
|
||||
* Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them.
|
||||
* Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation.
|
||||
* Don't assume a component's decisions will not be overridden or rejected, nor for the component to always understand why. For example, etcd may reject writes. Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, but back off and/or make alternative decisions.
|
||||
* Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans.
|
||||
* Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure.
|
||||
|
||||
## Architecture
|
||||
|
||||
* Only the apiserver should communicate with etcd/store, and not other components (scheduler, kubelet, etc.).
|
||||
* Compromising a single node shouldn't compromise the cluster.
|
||||
* Components should continue to do what they were last told in the absence of new instructions (e.g., due to network partition or component outage).
|
||||
* All components should keep all relevant state in memory all the time. The apiserver should write through to etcd/store, other components should write through to the apiserver, and they should watch for updates made by other clients.
|
||||
* Watch is preferred over polling.
|
||||
|
||||
## Extensibility
|
||||
|
||||
TODO: pluggability
|
||||
|
||||
## Bootstrapping
|
||||
|
||||
* [Self-hosting](http://issue.k8s.io/246) of all components is a goal.
|
||||
* Minimize the number of dependencies, particularly those required for steady-state operation.
|
||||
* Stratify the dependencies that remain via principled layering.
|
||||
* Break any circular dependencies by converting hard dependencies to soft dependencies.
|
||||
* Also accept that data from other components from another source, such as local files, which can then be manually populated at bootstrap time and then continuously updated once those other components are available.
|
||||
* State should be rediscoverable and/or reconstructable.
|
||||
* Make it easy to run temporary, bootstrap instances of all components in order to create the runtime state needed to run the components in the steady state; use a lock (master election for distributed components, file lock for local components like Kubelet) to coordinate handoff. We call this technique "pivoting".
|
||||
* Have a solution to restart dead components. For distributed components, replication works well. For local components such as Kubelet, a process manager or even a simple shell loop works.
|
||||
|
||||
## Availability
|
||||
|
||||
TODO
|
||||
|
||||
## General principles
|
||||
|
||||
* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules)
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,261 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "The Kubernetes resource model"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
**Note: this is a design doc, which describes features that have not been completely implemented.
|
||||
User documentation of the current state is [here](../user-guide/compute-resources.html). The tracking issue for
|
||||
implementation of this model is
|
||||
[#168](http://issue.k8s.io/168). Currently, both limits and requests of memory and
|
||||
cpu on containers (not pods) are supported. "memory" is in bytes and "cpu" is in
|
||||
milli-cores.**
|
||||
|
||||
# The Kubernetes resource model
|
||||
|
||||
To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model — the subject of this document.
|
||||
|
||||
The resource model aims to be:
|
||||
* simple, for common cases;
|
||||
* extensible, to accommodate future growth;
|
||||
* regular, with few special cases; and
|
||||
* precise, to avoid misunderstandings and promote pod portability.
|
||||
|
||||
## The resource model
|
||||
|
||||
A Kubernetes _resource_ is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth.
|
||||
|
||||
Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_.
|
||||
|
||||
Note that the resource model currently prohibits over-committing resources; we will want to relax that restriction later.
|
||||
|
||||
### Resource types
|
||||
|
||||
All resources have a _type_ that is identified by their _typename_ (a string, e.g., "memory"). Several resource types are predefined by Kubernetes (a full list is below), although only two will be supported at first: CPU and memory. Users and system administrators can define their own resource types if they wish (e.g., Hadoop slots).
|
||||
|
||||
A fully-qualified resource typename is constructed from a DNS-style _subdomain_, followed by a slash `/`, followed by a name.
|
||||
* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) (e.g., `kubernetes.io`, `example.com`).
|
||||
* The name must be not more than 63 characters, consisting of upper- or lower-case alphanumeric characters, with the `-`, `_`, and `.` characters allowed anywhere except the first or last character.
|
||||
* As a shorthand, any resource typename that does not start with a subdomain and a slash will automatically be prefixed with the built-in Kubernetes _namespace_, `kubernetes.io/` in order to fully-qualify it. This namespace is reserved for code in the open source Kubernetes repository; as a result, all user typenames MUST be fully qualified, and cannot be created in this namespace.
|
||||
|
||||
Some example typenames include `memory` (which will be fully-qualified as `kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`.
|
||||
|
||||
For future reference, note that some resources, such as CPU and network bandwidth, are _compressible_, which means that their usage can potentially be throttled in a relatively benign manner. All other resources are _incompressible_, which means that any attempt to throttle them is likely to cause grief. This distinction will be important if a Kubernetes implementation supports over-committing of resources.
|
||||
|
||||
### Resource quantities
|
||||
|
||||
Initially, all Kubernetes resource types are _quantitative_, and have an associated _unit_ for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?).
|
||||
|
||||
Resource quantities can be added and subtracted: for example, a node has a fixed quantity of each resource type that can be allocated to pods/containers; once such an allocation has been made, the allocated resources cannot be made available to other pods/containers without over-committing the resources.
|
||||
|
||||
To make life easier for people, quantities can be represented externally as unadorned integers, or as fixed-point integers with one of these SI suffices (E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, Ki). For example, the following represent roughly the same value: 128974848, "129e6", "129M" , "123Mi". Small quantities can be represented directly as decimals (e.g., 0.3), or using milli-units (e.g., "300m").
|
||||
* "Externally" means in user interfaces, reports, graphs, and in JSON or YAML resource specifications that might be generated or read by people.
|
||||
* Case is significant: "m" and "M" are not the same, so "k" is not a valid SI suffix. There are no power-of-two equivalents for SI suffixes that represent multipliers less than 1.
|
||||
* These conventions only apply to resource quantities, not arbitrary values.
|
||||
|
||||
Internally (i.e., everywhere else), Kubernetes will represent resource quantities as integers so it can avoid problems with rounding errors, and will not use strings to represent numeric values. To achieve this, quantities that naturally have fractional parts (e.g., CPU seconds/second) will be scaled to integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. Internal APIs, data structures, and protobufs will use these scaled integer units. Raw measurement data such as usage may still need to be tracked and calculated using floating point values, but internally they should be rescaled to avoid some values being in milli-units and some not.
|
||||
* Note that reading in a resource quantity and writing it out again may change the way its values are represented, and truncate precision (e.g., 1.0001 may become 1.000), so comparison and difference operations (e.g., by an updater) must be done on the internal representations.
|
||||
* Avoiding milli-units in external representations has advantages for people who will use Kubernetes, but runs the risk of developers forgetting to rescale or accidentally using floating-point representations. That seems like the right choice. We will try to reduce the risk by providing libraries that automatically do the quantization for JSON/YAML inputs.
|
||||
|
||||
### Resource specifications
|
||||
|
||||
Both users and a number of system components, such as schedulers, (horizontal) auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers need to reason about resource requirements of workloads, resource capacities of nodes, and resource usage. Kubernetes divides specifications of *desired state*, aka the Spec, and representations of *current state*, aka the Status. Resource requirements and total node capacity fall into the specification category, while resource usage, characterizations derived from usage (e.g., maximum usage, histograms), and other resource demand signals (e.g., CPU load) clearly fall into the status category and are discussed in the Appendix for now.
|
||||
|
||||
Resource requirements for a container or pod should have the following form:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
resourceRequirementSpec: [
|
||||
request: [ cpu: 2.5, memory: "40Mi" ],
|
||||
limit: [ cpu: 4.0, memory: "99Mi" ],
|
||||
]
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Where:
|
||||
* _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided — e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses.
|
||||
|
||||
* _limit_ [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its _limit_, it may be terminated. The _limit_ defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory.
|
||||
|
||||
Total capacity for a node should have a similar structure:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
resourceCapacitySpec: [
|
||||
total: [ cpu: 12, memory: "128Gi" ]
|
||||
]
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Where:
|
||||
* _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes.
|
||||
|
||||
#### Notes
|
||||
|
||||
* It is an error to specify the same resource type more than once in each list.
|
||||
|
||||
* It is an error for the _request_ or _limit_ values for a pod to be less than the sum of the (explicit or defaulted) values for the containers it encloses. (We may relax this later.)
|
||||
|
||||
* If multiple pods are running on the same node and attempting to use more resources than they have requested, the result is implementation-defined. For example: unallocated or unused resources might be spread equally across claimants, or the assignment might be weighted by the size of the original request, or as a function of limits, or priority, or the phase of the moon, perhaps modulated by the direction of the tide. Thus, although it's not mandatory to provide a _request_, it's probably a good idea. (Note that the _request_ could be filled in by an automated system that is observing actual usage and/or historical data.)
|
||||
|
||||
* Internally, the Kubernetes master can decide the defaulting behavior and the kubelet implementation may expected an absolute specification. For example, if the master decided that "the default is unbounded" it would pass 2^64 to the kubelet.
|
||||
|
||||
|
||||
## Kubernetes-defined resource types
|
||||
|
||||
The following resource types are predefined ("reserved") by Kubernetes in the `kubernetes.io` namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet.
|
||||
|
||||
### Processor cycles
|
||||
|
||||
* Name: `cpu` (or `kubernetes.io/cpu`)
|
||||
* Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU")
|
||||
* Internal representation: milli-KCUs
|
||||
* Compressible? yes
|
||||
* Qualities: this is a placeholder for the kind of thing that may be supported in the future — see [#147](http://issue.k8s.io/147)
|
||||
* [future] `schedulingLatency`: as per lmctfy
|
||||
* [future] `cpuConversionFactor`: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0).
|
||||
|
||||
To reduce performance portability problems for pods, and to avoid worse-case provisioning behavior, the units of CPU will be normalized to a canonical "Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be equivalent to a single CPU hyperthreaded core for some recent x86 processor. The normalization may be implementation-defined, although some reasonable defaults will be provided in the open-source Kubernetes code.
|
||||
|
||||
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated — control of aspects like this will be handled by resource _qualities_ (a future feature).
|
||||
|
||||
|
||||
### Memory
|
||||
|
||||
* Name: `memory` (or `kubernetes.io/memory`)
|
||||
* Units: bytes
|
||||
* Compressible? no (at least initially)
|
||||
|
||||
The precise meaning of what "memory" means is implementation dependent, but the basic idea is to rely on the underlying `memcg` mechanisms, support, and definitions.
|
||||
|
||||
Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory quantities
|
||||
rather than decimal ones: "64MiB" rather than "64MB".
|
||||
|
||||
|
||||
## Resource metadata
|
||||
|
||||
A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
resourceTypes: [
|
||||
"kubernetes.io/memory": [
|
||||
isCompressible: false, ...
|
||||
]
|
||||
"kubernetes.io/cpu": [
|
||||
isCompressible: true,
|
||||
internalScaleExponent: 3, ...
|
||||
]
|
||||
"kubernetes.io/disk-space": [ ... ]
|
||||
]
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Kubernetes will provide ResourceType metadata for its predefined types. If no resource metadata can be found for a resource type, Kubernetes will assume that it is a quantified, incompressible resource that is not specified in milli-units, and has no default value.
|
||||
|
||||
The defined properties are as follows:
|
||||
|
||||
| field name | type | contents |
|
||||
| ---------- | ---- | -------- |
|
||||
| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) |
|
||||
| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) |
|
||||
| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". |
|
||||
| isCompressible | bool, default=false | true if the resource type is compressible |
|
||||
| defaultRequest | string, default=none | in the same format as a user-supplied value |
|
||||
| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). |
|
||||
|
||||
|
||||
# Appendix: future extensions
|
||||
|
||||
The following are planned future extensions to the resource model, included here to encourage comments.
|
||||
|
||||
## Usage data
|
||||
|
||||
Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../user-guide/pods.html) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD.
|
||||
|
||||
Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
resourceStatus: [
|
||||
usage: [ cpu: <CPU-info>, memory: <memory-info> ],
|
||||
maxusage: [ cpu: <CPU-info>, memory: <memory-info> ],
|
||||
predicted: [ cpu: <CPU-info>, memory: <memory-info> ],
|
||||
]
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
where a `<CPU-info>` or `<memory-info>` structure looks like this:
|
||||
|
||||
{% highlight yaml %}
|
||||
{% raw %}
|
||||
{
|
||||
mean: <value> # arithmetic mean
|
||||
max: <value> # minimum value
|
||||
min: <value> # maximum value
|
||||
count: <value> # number of data points
|
||||
percentiles: [ # map from %iles to values
|
||||
"10": <10th-percentile-value>,
|
||||
"50": <median-value>,
|
||||
"99": <99th-percentile-value>,
|
||||
"99.9": <99.9th-percentile-value>,
|
||||
...
|
||||
]
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_
|
||||
and predicted
|
||||
|
||||
## Future resource types
|
||||
|
||||
### _[future] Network bandwidth_
|
||||
|
||||
* Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`)
|
||||
* Units: bytes per second
|
||||
* Compressible? yes
|
||||
|
||||
### _[future] Network operations_
|
||||
|
||||
* Name: "network-iops" (or `kubernetes.io/network-iops`)
|
||||
* Units: operations (messages) per second
|
||||
* Compressible? yes
|
||||
|
||||
### _[future] Storage space_
|
||||
|
||||
* Name: "storage-space" (or `kubernetes.io/storage-space`)
|
||||
* Units: bytes
|
||||
* Compressible? no
|
||||
|
||||
The amount of secondary storage space available to a container. The main target is local disk drives and SSDs, although this could also be used to qualify remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a disk array, or a file system fronting any of these, is left for future work.
|
||||
|
||||
### _[future] Storage time_
|
||||
|
||||
* Name: storage-time (or `kubernetes.io/storage-time`)
|
||||
* Units: seconds per second of disk time
|
||||
* Internal representation: milli-units
|
||||
* Compressible? yes
|
||||
|
||||
This is the amount of time a container spends accessing disk, including actuator and transfer time. A standard disk drive provides 1.0 diskTime seconds per second.
|
||||
|
||||
### _[future] Storage operations_
|
||||
|
||||
* Name: "storage-iops" (or `kubernetes.io/storage-iops`)
|
||||
* Units: operations per second
|
||||
* Compressible? yes
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,611 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Abstract"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## Abstract
|
||||
|
||||
A proposal for the distribution of [secrets](../user-guide/secrets.html) (passwords, keys, etc) to the Kubelet and to
|
||||
containers inside Kubernetes using a custom [volume](../user-guide/volumes.html#secrets) type. See the [secrets example](../user-guide/secrets/) for more information.
|
||||
|
||||
## Motivation
|
||||
|
||||
Secrets are needed in containers to access internal resources like the Kubernetes master or
|
||||
external resources such as git repositories, databases, etc. Users may also want behaviors in the
|
||||
kubelet that depend on secret data (credentials for image pull from a docker registry) associated
|
||||
with pods.
|
||||
|
||||
Goals of this design:
|
||||
|
||||
1. Describe a secret resource
|
||||
2. Define the various challenges attendant to managing secrets on the node
|
||||
3. Define a mechanism for consuming secrets in containers without modification
|
||||
|
||||
## Constraints and Assumptions
|
||||
|
||||
* This design does not prescribe a method for storing secrets; storage of secrets should be
|
||||
pluggable to accommodate different use-cases
|
||||
* Encryption of secret data and node security are orthogonal concerns
|
||||
* It is assumed that node and master are secure and that compromising their security could also
|
||||
compromise secrets:
|
||||
* If a node is compromised, the only secrets that could potentially be exposed should be the
|
||||
secrets belonging to containers scheduled onto it
|
||||
* If the master is compromised, all secrets in the cluster may be exposed
|
||||
* Secret rotation is an orthogonal concern, but it should be facilitated by this proposal
|
||||
* A user who can consume a secret in a container can know the value of the secret; secrets must
|
||||
be provisioned judiciously
|
||||
|
||||
## Use Cases
|
||||
|
||||
1. As a user, I want to store secret artifacts for my applications and consume them securely in
|
||||
containers, so that I can keep the configuration for my applications separate from the images
|
||||
that use them:
|
||||
1. As a cluster operator, I want to allow a pod to access the Kubernetes master using a custom
|
||||
`.kubeconfig` file, so that I can securely reach the master
|
||||
2. As a cluster operator, I want to allow a pod to access a Docker registry using credentials
|
||||
from a `.dockercfg` file, so that containers can push images
|
||||
3. As a cluster operator, I want to allow a pod to access a git repository using SSH keys,
|
||||
so that I can push to and fetch from the repository
|
||||
2. As a user, I want to allow containers to consume supplemental information about services such
|
||||
as username and password which should be kept secret, so that I can share secrets about a
|
||||
service amongst the containers in my application securely
|
||||
3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a secret and have
|
||||
the kubelet implement some reserved behaviors based on the types of secrets the service account
|
||||
consumes:
|
||||
1. Use credentials for a docker registry to pull the pod's docker image
|
||||
2. Present Kubernetes auth token to the pod or transparently decorate traffic between the pod
|
||||
and master service
|
||||
4. As a user, I want to be able to indicate that a secret expires and for that secret's value to
|
||||
be rotated once it expires, so that the system can help me follow good practices
|
||||
|
||||
### Use-Case: Configuration artifacts
|
||||
|
||||
Many configuration files contain secrets intermixed with other configuration information. For
|
||||
example, a user's application may contain a properties file than contains database credentials,
|
||||
SaaS API tokens, etc. Users should be able to consume configuration artifacts in their containers
|
||||
and be able to control the path on the container's filesystems where the artifact will be
|
||||
presented.
|
||||
|
||||
### Use-Case: Metadata about services
|
||||
|
||||
Most pieces of information about how to use a service are secrets. For example, a service that
|
||||
provides a MySQL database needs to provide the username, password, and database name to consumers
|
||||
so that they can authenticate and use the correct database. Containers in pods consuming the MySQL
|
||||
service would also consume the secrets associated with the MySQL service.
|
||||
|
||||
### Use-Case: Secrets associated with service accounts
|
||||
|
||||
[Service Accounts](service_accounts.html) are proposed as a
|
||||
mechanism to decouple capabilities and security contexts from individual human users. A
|
||||
`ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is
|
||||
associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and
|
||||
other system components to take action based on the secret's type.
|
||||
|
||||
#### Example: service account consumes auth token secret
|
||||
|
||||
As an example, the service account proposal discusses service accounts consuming secrets which
|
||||
contain Kubernetes auth tokens. When a Kubelet starts a pod associated with a service account
|
||||
which consumes this type of secret, the Kubelet may take a number of actions:
|
||||
|
||||
1. Expose the secret in a `.kubernetes_auth` file in a well-known location in the container's
|
||||
file system
|
||||
2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the
|
||||
`kubernetes-master` service with the auth token, e. g. by adding a header to the request
|
||||
(see the [LOAS Daemon](http://issue.k8s.io/2209) proposal)
|
||||
|
||||
#### Example: service account consumes docker registry credentials
|
||||
|
||||
Another example use case is where a pod is associated with a secret containing docker registry
|
||||
credentials. The Kubelet could use these credentials for the docker pull to retrieve the image.
|
||||
|
||||
### Use-Case: Secret expiry and rotation
|
||||
|
||||
Rotation is considered a good practice for many types of secret data. It should be possible to
|
||||
express that a secret has an expiry date; this would make it possible to implement a system
|
||||
component that could regenerate expired secrets. As an example, consider a component that rotates
|
||||
expired secrets. The rotator could periodically regenerate the values for expired secrets of
|
||||
common types and update their expiry dates.
|
||||
|
||||
## Deferral: Consuming secrets as environment variables
|
||||
|
||||
Some images will expect to receive configuration items as environment variables instead of files.
|
||||
We should consider what the best way to allow this is; there are a few different options:
|
||||
|
||||
1. Force the user to adapt files into environment variables. Users can store secrets that need to
|
||||
be presented as environment variables in a format that is easy to consume from a shell:
|
||||
|
||||
$ cat /etc/secrets/my-secret.txt
|
||||
export MY_SECRET_ENV=MY_SECRET_VALUE
|
||||
|
||||
The user could `source` the file at `/etc/secrets/my-secret` prior to executing the command for
|
||||
the image either inline in the command or in an init script,
|
||||
|
||||
2. Give secrets an attribute that allows users to express the intent that the platform should
|
||||
generate the above syntax in the file used to present a secret. The user could consume these
|
||||
files in the same manner as the above option.
|
||||
|
||||
3. Give secrets attributes that allow the user to express that the secret should be presented to
|
||||
the container as an environment variable. The container's environment would contain the
|
||||
desired values and the software in the container could use them without accommodation the
|
||||
command or setup script.
|
||||
|
||||
For our initial work, we will treat all secrets as files to narrow the problem space. There will
|
||||
be a future proposal that handles exposing secrets as environment variables.
|
||||
|
||||
## Flow analysis of secret data with respect to the API server
|
||||
|
||||
There are two fundamentally different use-cases for access to secrets:
|
||||
|
||||
1. CRUD operations on secrets by their owners
|
||||
2. Read-only access to the secrets needed for a particular node by the kubelet
|
||||
|
||||
### Use-Case: CRUD operations by owners
|
||||
|
||||
In use cases for CRUD operations, the user experience for secrets should be no different than for
|
||||
other API resources.
|
||||
|
||||
#### Data store backing the REST API
|
||||
|
||||
The data store backing the REST API should be pluggable because different cluster operators will
|
||||
have different preferences for the central store of secret data. Some possibilities for storage:
|
||||
|
||||
1. An etcd collection alongside the storage for other API resources
|
||||
2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module)
|
||||
3. A secrets server like [Vault](https://www.vaultproject.io/) or [Keywhiz](https://square.github.io/keywhiz/)
|
||||
4. An external datastore such as an external etcd, RDBMS, etc.
|
||||
|
||||
#### Size limit for secrets
|
||||
|
||||
There should be a size limit for secrets in order to:
|
||||
|
||||
1. Prevent DOS attacks against the API server
|
||||
2. Allow kubelet implementations that prevent secret data from touching the node's filesystem
|
||||
|
||||
The size limit should satisfy the following conditions:
|
||||
|
||||
1. Large enough to store common artifact types (encryption keypairs, certificates, small
|
||||
configuration files)
|
||||
2. Small enough to avoid large impact on node resource consumption (storage, RAM for tmpfs, etc)
|
||||
|
||||
To begin discussion, we propose an initial value for this size limit of **1MB**.
|
||||
|
||||
#### Other limitations on secrets
|
||||
|
||||
Defining a policy for limitations on how a secret may be referenced by another API resource and how
|
||||
constraints should be applied throughout the cluster is tricky due to the number of variables
|
||||
involved:
|
||||
|
||||
1. Should there be a maximum number of secrets a pod can reference via a volume?
|
||||
2. Should there be a maximum number of secrets a service account can reference?
|
||||
3. Should there be a total maximum number of secrets a pod can reference via its own spec and its
|
||||
associated service account?
|
||||
4. Should there be a total size limit on the amount of secret data consumed by a pod?
|
||||
5. How will cluster operators want to be able to configure these limits?
|
||||
6. How will these limits impact API server validations?
|
||||
7. How will these limits affect scheduling?
|
||||
|
||||
For now, we will not implement validations around these limits. Cluster operators will decide how
|
||||
much node storage is allocated to secrets. It will be the operator's responsibility to ensure that
|
||||
the allocated storage is sufficient for the workload scheduled onto a node.
|
||||
|
||||
For now, kubelets will only attach secrets to api-sourced pods, and not file- or http-sourced
|
||||
ones. Doing so would:
|
||||
- confuse the secrets admission controller in the case of mirror pods.
|
||||
- create an apiserver-liveness dependency -- avoiding this dependency is a main reason to use non-api-source pods.
|
||||
|
||||
### Use-Case: Kubelet read of secrets for node
|
||||
|
||||
The use-case where the kubelet reads secrets has several additional requirements:
|
||||
|
||||
1. Kubelets should only be able to receive secret data which is required by pods scheduled onto
|
||||
the kubelet's node
|
||||
2. Kubelets should have read-only access to secret data
|
||||
3. Secret data should not be transmitted over the wire insecurely
|
||||
4. Kubelets must ensure pods do not have access to each other's secrets
|
||||
|
||||
#### Read of secret data by the Kubelet
|
||||
|
||||
The Kubelet should only be allowed to read secrets which are consumed by pods scheduled onto that
|
||||
Kubelet's node and their associated service accounts. Authorization of the Kubelet to read this
|
||||
data would be delegated to an authorization plugin and associated policy rule.
|
||||
|
||||
#### Secret data on the node: data at rest
|
||||
|
||||
Consideration must be given to whether secret data should be allowed to be at rest on the node:
|
||||
|
||||
1. If secret data is not allowed to be at rest, the size of secret data becomes another draw on
|
||||
the node's RAM - should it affect scheduling?
|
||||
2. If secret data is allowed to be at rest, should it be encrypted?
|
||||
1. If so, how should be this be done?
|
||||
2. If not, what threats exist? What types of secret are appropriate to store this way?
|
||||
|
||||
For the sake of limiting complexity, we propose that initially secret data should not be allowed
|
||||
to be at rest on a node; secret data should be stored on a node-level tmpfs filesystem. This
|
||||
filesystem can be subdivided into directories for use by the kubelet and by the volume plugin.
|
||||
|
||||
#### Secret data on the node: resource consumption
|
||||
|
||||
The Kubelet will be responsible for creating the per-node tmpfs file system for secret storage.
|
||||
It is hard to make a prescriptive declaration about how much storage is appropriate to reserve for
|
||||
secrets because different installations will vary widely in available resources, desired pod to
|
||||
node density, overcommit policy, and other operation dimensions. That being the case, we propose
|
||||
for simplicity that the amount of secret storage be controlled by a new parameter to the kubelet
|
||||
with a default value of **64MB**. It is the cluster operator's responsibility to handle choosing
|
||||
the right storage size for their installation and configuring their Kubelets correctly.
|
||||
|
||||
Configuring each Kubelet is not the ideal story for operator experience; it is more intuitive that
|
||||
the cluster-wide storage size be readable from a central configuration store like the one proposed
|
||||
in [#1553](http://issue.k8s.io/1553). When such a store
|
||||
exists, the Kubelet could be modified to read this configuration item from the store.
|
||||
|
||||
When the Kubelet is modified to advertise node resources (as proposed in
|
||||
[#4441](http://issue.k8s.io/4441)), the capacity calculation
|
||||
for available memory should factor in the potential size of the node-level tmpfs in order to avoid
|
||||
memory overcommit on the node.
|
||||
|
||||
#### Secret data on the node: isolation
|
||||
|
||||
Every pod will have a [security context](security_context.html).
|
||||
Secret data on the node should be isolated according to the security context of the container. The
|
||||
Kubelet volume plugin API will be changed so that a volume plugin receives the security context of
|
||||
a volume along with the volume spec. This will allow volume plugins to implement setting the
|
||||
security context of volumes they manage.
|
||||
|
||||
## Community work
|
||||
|
||||
Several proposals / upstream patches are notable as background for this proposal:
|
||||
|
||||
1. [Docker vault proposal](https://github.com/docker/docker/issues/10310)
|
||||
2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277)
|
||||
3. [Kubernetes service account proposal](service_accounts.html)
|
||||
4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075)
|
||||
5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697)
|
||||
|
||||
## Proposed Design
|
||||
|
||||
We propose a new `Secret` resource which is mounted into containers with a new volume type. Secret
|
||||
volumes will be handled by a volume plugin that does the actual work of fetching the secret and
|
||||
storing it. Secrets contain multiple pieces of data that are presented as different files within
|
||||
the secret volume (example: SSH key pair).
|
||||
|
||||
In order to remove the burden from the end user in specifying every file that a secret consists of,
|
||||
it should be possible to mount all files provided by a secret with a single `VolumeMount` entry
|
||||
in the container specification.
|
||||
|
||||
### Secret API Resource
|
||||
|
||||
A new resource for secrets will be added to the API:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type Secret struct {
|
||||
TypeMeta
|
||||
ObjectMeta
|
||||
|
||||
// Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN.
|
||||
// The serialized form of the secret data is a base64 encoded string,
|
||||
// representing the arbitrary (possibly non-string) data value here.
|
||||
Data map[string][]byte `json:"data,omitempty"`
|
||||
|
||||
// Used to facilitate programmatic handling of secret data.
|
||||
Type SecretType `json:"type,omitempty"`
|
||||
}
|
||||
|
||||
type SecretType string
|
||||
|
||||
const (
|
||||
SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default)
|
||||
SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token
|
||||
SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth
|
||||
// FUTURE: other type values
|
||||
)
|
||||
|
||||
const MaxSecretSize = 1 * 1024 * 1024
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
A Secret can declare a type in order to provide type information to system components that work
|
||||
with secrets. The default type is `opaque`, which represents arbitrary user-owned data.
|
||||
|
||||
Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must be valid DNS
|
||||
subdomains.
|
||||
|
||||
A new REST API and registry interface will be added to accompany the `Secret` resource. The
|
||||
default implementation of the registry will store `Secret` information in etcd. Future registry
|
||||
implementations could store the `TypeMeta` and `ObjectMeta` fields in etcd and store the secret
|
||||
data in another data store entirely, or store the whole object in another data store.
|
||||
|
||||
#### Other validations related to secrets
|
||||
|
||||
Initially there will be no validations for the number of secrets a pod references, or the number of
|
||||
secrets that can be associated with a service account. These may be added in the future as the
|
||||
finer points of secrets and resource allocation are fleshed out.
|
||||
|
||||
### Secret Volume Source
|
||||
|
||||
A new `SecretSource` type of volume source will be added to the `VolumeSource` struct in the
|
||||
API:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type VolumeSource struct {
|
||||
// Other fields omitted
|
||||
|
||||
// SecretSource represents a secret that should be presented in a volume
|
||||
SecretSource *SecretSource `json:"secret"`
|
||||
}
|
||||
|
||||
type SecretSource struct {
|
||||
Target ObjectReference
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
Secret volume sources are validated to ensure that the specified object reference actually points
|
||||
to an object of type `Secret`.
|
||||
|
||||
In the future, the `SecretSource` will be extended to allow:
|
||||
|
||||
1. Fine-grained control over which pieces of secret data are exposed in the volume
|
||||
2. The paths and filenames for how secret data are exposed
|
||||
|
||||
### Secret Volume Plugin
|
||||
|
||||
A new Kubelet volume plugin will be added to handle volumes with a secret source. This plugin will
|
||||
require access to the API server to retrieve secret data and therefore the volume `Host` interface
|
||||
will have to change to expose a client interface:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type Host interface {
|
||||
// Other methods omitted
|
||||
|
||||
// GetKubeClient returns a client interface
|
||||
GetKubeClient() client.Interface
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The secret volume plugin will be responsible for:
|
||||
|
||||
1. Returning a `volume.Builder` implementation from `NewBuilder` that:
|
||||
1. Retrieves the secret data for the volume from the API server
|
||||
2. Places the secret data onto the container's filesystem
|
||||
3. Sets the correct security attributes for the volume based on the pod's `SecurityContext`
|
||||
2. Returning a `volume.Cleaner` implementation from `NewClear` that cleans the volume from the
|
||||
container's filesystem
|
||||
|
||||
### Kubelet: Node-level secret storage
|
||||
|
||||
The Kubelet must be modified to accept a new parameter for the secret storage size and to create
|
||||
a tmpfs file system of that size to store secret data. Rough accounting of specific changes:
|
||||
|
||||
1. The Kubelet should have a new field added called `secretStorageSize`; units are megabytes
|
||||
2. `NewMainKubelet` should accept a value for secret storage size
|
||||
3. The Kubelet server should have a new flag added for secret storage size
|
||||
4. The Kubelet's `setupDataDirs` method should be changed to create the secret storage
|
||||
|
||||
### Kubelet: New behaviors for secrets associated with service accounts
|
||||
|
||||
For use-cases where the Kubelet's behavior is affected by the secrets associated with a pod's
|
||||
`ServiceAccount`, the Kubelet will need to be changed. For example, if secrets of type
|
||||
`docker-reg-auth` affect how the pod's images are pulled, the Kubelet will need to be changed
|
||||
to accommodate this. Subsequent proposals can address this on a type-by-type basis.
|
||||
|
||||
## Examples
|
||||
|
||||
For clarity, let's examine some detailed examples of some common use-cases in terms of the
|
||||
suggested changes. All of these examples are assumed to be created in a namespace called
|
||||
`example`.
|
||||
|
||||
### Use-Case: Pod with ssh keys
|
||||
|
||||
To create a pod that uses an ssh key stored as a secret, we first need to create a secret:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"kind": "Secret",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "ssh-key-secret"
|
||||
},
|
||||
"data": {
|
||||
"id-rsa": "dmFsdWUtMg0KDQo=",
|
||||
"id-rsa.pub": "dmFsdWUtMQ0K"
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
**Note:** The serialized JSON and YAML values of secret data are encoded as
|
||||
base64 strings. Newlines are not valid within these strings and must be
|
||||
omitted.
|
||||
|
||||
Now we can create a pod which references the secret with the ssh key and consumes it in a volume:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"kind": "Pod",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "secret-test-pod",
|
||||
"labels": {
|
||||
"name": "secret-test"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"volumes": [
|
||||
{
|
||||
"name": "secret-volume",
|
||||
"secret": {
|
||||
"secretName": "ssh-key-secret"
|
||||
}
|
||||
}
|
||||
],
|
||||
"containers": [
|
||||
{
|
||||
"name": "ssh-test-container",
|
||||
"image": "mySshImage",
|
||||
"volumeMounts": [
|
||||
{
|
||||
"name": "secret-volume",
|
||||
"readOnly": true,
|
||||
"mountPath": "/etc/secret-volume"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
When the container's command runs, the pieces of the key will be available in:
|
||||
|
||||
/etc/secret-volume/id-rsa.pub
|
||||
/etc/secret-volume/id-rsa
|
||||
|
||||
The container is then free to use the secret data to establish an ssh connection.
|
||||
|
||||
### Use-Case: Pods with pod / test credentials
|
||||
|
||||
This example illustrates a pod which consumes a secret containing prod
|
||||
credentials and another pod which consumes a secret with test environment
|
||||
credentials.
|
||||
|
||||
The secrets:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"apiVersion": "v1",
|
||||
"kind": "List",
|
||||
"items":
|
||||
[{
|
||||
"kind": "Secret",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "prod-db-secret"
|
||||
},
|
||||
"data": {
|
||||
"password": "dmFsdWUtMg0KDQo=",
|
||||
"username": "dmFsdWUtMQ0K"
|
||||
}
|
||||
},
|
||||
{
|
||||
"kind": "Secret",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "test-db-secret"
|
||||
},
|
||||
"data": {
|
||||
"password": "dmFsdWUtMg0KDQo=",
|
||||
"username": "dmFsdWUtMQ0K"
|
||||
}
|
||||
}]
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The pods:
|
||||
|
||||
{% highlight json %}
|
||||
{% raw %}
|
||||
{
|
||||
"apiVersion": "v1",
|
||||
"kind": "List",
|
||||
"items":
|
||||
[{
|
||||
"kind": "Pod",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "prod-db-client-pod",
|
||||
"labels": {
|
||||
"name": "prod-db-client"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"volumes": [
|
||||
{
|
||||
"name": "secret-volume",
|
||||
"secret": {
|
||||
"secretName": "prod-db-secret"
|
||||
}
|
||||
}
|
||||
],
|
||||
"containers": [
|
||||
{
|
||||
"name": "db-client-container",
|
||||
"image": "myClientImage",
|
||||
"volumeMounts": [
|
||||
{
|
||||
"name": "secret-volume",
|
||||
"readOnly": true,
|
||||
"mountPath": "/etc/secret-volume"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"kind": "Pod",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "test-db-client-pod",
|
||||
"labels": {
|
||||
"name": "test-db-client"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"volumes": [
|
||||
{
|
||||
"name": "secret-volume",
|
||||
"secret": {
|
||||
"secretName": "test-db-secret"
|
||||
}
|
||||
}
|
||||
],
|
||||
"containers": [
|
||||
{
|
||||
"name": "db-client-container",
|
||||
"image": "myClientImage",
|
||||
"volumeMounts": [
|
||||
{
|
||||
"name": "secret-volume",
|
||||
"readOnly": true,
|
||||
"mountPath": "/etc/secret-volume"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}]
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
The specs for the two pods differ only in the value of the object referred to by the secret volume
|
||||
source. Both containers will have the following files present on their filesystems:
|
||||
|
||||
/etc/secret-volume/username
|
||||
/etc/secret-volume/password
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,139 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Security in Kubernetes"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Security in Kubernetes
|
||||
|
||||
Kubernetes should define a reasonable set of security best practices that allows processes to be isolated from each other, from the cluster infrastructure, and which preserves important boundaries between those who manage the cluster, and those who use the cluster.
|
||||
|
||||
While Kubernetes today is not primarily a multi-tenant system, the long term evolution of Kubernetes will increasingly rely on proper boundaries between users and administrators. The code running on the cluster must be appropriately isolated and secured to prevent malicious parties from affecting the entire cluster.
|
||||
|
||||
|
||||
## High Level Goals
|
||||
|
||||
1. Ensure a clear isolation between the container and the underlying host it runs on
|
||||
2. Limit the ability of the container to negatively impact the infrastructure or other containers
|
||||
3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - ensure components are only authorized to perform the actions they need, and limit the scope of a compromise by limiting the capabilities of individual components
|
||||
4. Reduce the number of systems that have to be hardened and secured by defining clear boundaries between components
|
||||
5. Allow users of the system to be cleanly separated from administrators
|
||||
6. Allow administrative functions to be delegated to users where necessary
|
||||
7. Allow applications to be run on the cluster that have "secret" data (keys, certs, passwords) which is properly abstracted from "public" data.
|
||||
|
||||
|
||||
## Use cases
|
||||
|
||||
### Roles
|
||||
|
||||
We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories:
|
||||
|
||||
1. k8s admin - administers a Kubernetes cluster and has access to the underlying components of the system
|
||||
2. k8s project administrator - administrates the security of a small subset of the cluster
|
||||
3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster resources
|
||||
|
||||
Automated process users fall into the following categories:
|
||||
|
||||
1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources independent of the human users attached to a project
|
||||
2. k8s infrastructure user - the user that Kubernetes infrastructure components use to perform cluster functions with clearly defined roles
|
||||
|
||||
|
||||
### Description of roles
|
||||
|
||||
* Developers:
|
||||
* write pod specs.
|
||||
* making some of their own images, and using some "community" docker images
|
||||
* know which pods need to talk to which other pods
|
||||
* decide which pods should share files with other pods, and which should not.
|
||||
* reason about application level security, such as containing the effects of a local-file-read exploit in a webserver pod.
|
||||
* do not often reason about operating system or organizational security.
|
||||
* are not necessarily comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
|
||||
|
||||
* Project Admins:
|
||||
* allocate identity and roles within a namespace
|
||||
* reason about organizational security within a namespace
|
||||
* don't give a developer permissions that are not needed for role.
|
||||
* protect files on shared storage from unnecessary cross-team access
|
||||
* are less focused about application security
|
||||
|
||||
* Administrators:
|
||||
* are less focused on application security. Focused on operating system security.
|
||||
* protect the node from bad actors in containers, and properly-configured innocent containers from bad actors in other containers.
|
||||
* comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
|
||||
* decides who can use which Linux Capabilities, run privileged containers, use hostPath, etc.
|
||||
* e.g. a team that manages Ceph or a mysql server might be trusted to have raw access to storage devices in some organizations, but teams that develop the applications at higher layers would not.
|
||||
|
||||
|
||||
## Proposed Design
|
||||
|
||||
A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*.
|
||||
|
||||
|
||||
1. The API should authenticate and authorize user actions [authn and authz](access.html)
|
||||
2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API.
|
||||
3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd)
|
||||
4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](service_accounts.html)
|
||||
1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption
|
||||
2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk
|
||||
3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action
|
||||
5. When container processes run on the cluster, they should run in a [security context](security_context.html) that isolates those processes via Linux user security, user namespaces, and permissions.
|
||||
1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID
|
||||
2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID
|
||||
3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions
|
||||
4. Project administrators should be able to run pods within a namespace under different security contexts, and developers must be able to specify which of the available security contexts they may use
|
||||
5. Developers should be able to run their own images or images from the community and expect those images to run correctly
|
||||
6. Developers may need to ensure their images work within higher security requirements specified by administrators
|
||||
7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met.
|
||||
8. When application developers want to share filesystem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes
|
||||
6. Developers should be able to define [secrets](secrets.html) that are automatically added to the containers when pods are run
|
||||
1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples:
|
||||
1. An SSH private key for git cloning remote data
|
||||
2. A client certificate for accessing a remote system
|
||||
3. A private key and certificate for a web server
|
||||
4. A .kubeconfig file with embedded cert / token data for accessing the Kubernetes master
|
||||
5. A .dockercfg file for pulling images from a protected registry
|
||||
2. Developers should be able to define the pod spec so that a secret lands in a specific location
|
||||
3. Project administrators should be able to limit developers within a namespace from viewing or modifying secrets (anyone who can launch an arbitrary pod can view secrets)
|
||||
4. Secrets are generally not copied from one namespace to another when a developer's application definitions are copied
|
||||
|
||||
|
||||
### Related design discussion
|
||||
|
||||
* [Authorization and authentication](access.html)
|
||||
* [Secret distribution via files](http://pr.k8s.io/2030)
|
||||
* [Docker secrets](https://github.com/docker/docker/pull/6697)
|
||||
* [Docker vault](https://github.com/docker/docker/issues/10310)
|
||||
* [Service Accounts:](service_accounts.html)
|
||||
* [Secret volumes](http://pr.k8s.io/4126)
|
||||
|
||||
## Specific Design Points
|
||||
|
||||
### TODO: authorization, authentication
|
||||
|
||||
### Isolate the data store from the nodes and supporting infrastructure
|
||||
|
||||
Access to the central data store (etcd) in Kubernetes allows an attacker to run arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history of the custer.
|
||||
|
||||
As a general principle, access to the central data store should be restricted to the components that need full control over the system and which can apply appropriate authorization and authentication of change requests. In the future, etcd may offer granular access control, but that granularity will require an administrator to understand the schema of the data to properly apply security. An administrator must be able to properly secure Kubernetes at a policy level, rather than at an implementation level, and schema changes over time should not risk unintended security leaks.
|
||||
|
||||
Both the Kubelet and Kube Proxy need information related to their specific roles - for the Kubelet, the set of pods it should be running, and for the Proxy, the set of services and endpoints to load balance. The Kubelet also needs to provide information about running pods and historical termination data. The access pattern for both Kubelet and Proxy to load their configuration is an efficient "wait for changes" request over HTTP. It should be possible to limit the Kubelet and Proxy to only access the information they need to perform their roles and no more.
|
||||
|
||||
The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes.
|
||||
|
||||
The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a node in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|
|
@ -0,0 +1,188 @@
|
|||
---
|
||||
layout: docwithnav
|
||||
title: "Security Contexts"
|
||||
---
|
||||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Security Contexts
|
||||
|
||||
## Abstract
|
||||
|
||||
A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.html)):
|
||||
|
||||
1. Ensure a clear isolation between container and the underlying host it runs on
|
||||
2. Limit the ability of the container to negatively impact the infrastructure or other containers
|
||||
|
||||
## Background
|
||||
|
||||
The problem of securing containers in Kubernetes has come up [before](http://issue.k8s.io/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface.
|
||||
|
||||
## Motivation
|
||||
|
||||
### Container isolation
|
||||
|
||||
In order to improve container isolation from host and other containers running on the host, containers should only be
|
||||
granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
|
||||
features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration)
|
||||
to the container process.
|
||||
|
||||
Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers.
|
||||
|
||||
### External integration with shared storage
|
||||
|
||||
In order to support external integration with shared storage, processes running in a Kubernetes cluster
|
||||
should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established.
|
||||
Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks.
|
||||
|
||||
## Constraints and Assumptions
|
||||
|
||||
* It is out of the scope of this document to prescribe a specific set
|
||||
of constraints to isolate containers from their host. Different use cases need different
|
||||
settings.
|
||||
* The concept of a security context should not be tied to a particular security mechanism or platform
|
||||
(ie. SELinux, AppArmor)
|
||||
* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for
|
||||
[service accounts](service_accounts.html).
|
||||
|
||||
## Use Cases
|
||||
|
||||
In order of increasing complexity, following are example use cases that would
|
||||
be addressed with security contexts:
|
||||
|
||||
1. Kubernetes is used to run a single cloud application. In order to protect
|
||||
nodes from containers:
|
||||
* All containers run as a single non-root user
|
||||
* Privileged containers are disabled
|
||||
* All containers run with a particular MCS label
|
||||
* Kernel capabilities like CHOWN and MKNOD are removed from containers
|
||||
|
||||
2. Just like case #1, except that I have more than one application running on
|
||||
the Kubernetes cluster.
|
||||
* Each application is run in its own namespace to avoid name collisions
|
||||
* For each application a different uid and MCS label is used
|
||||
|
||||
3. Kubernetes is used as the base for a PAAS with
|
||||
multiple projects, each project represented by a namespace.
|
||||
* Each namespace is associated with a range of uids/gids on the node that
|
||||
are mapped to uids/gids on containers using linux user namespaces.
|
||||
* Certain pods in each namespace have special privileges to perform system
|
||||
actions such as talking back to the server for deployment, run docker
|
||||
builds, etc.
|
||||
* External NFS storage is assigned to each namespace and permissions set
|
||||
using the range of uids/gids assigned to that namespace.
|
||||
|
||||
## Proposed Design
|
||||
|
||||
### Overview
|
||||
|
||||
A *security context* consists of a set of constraints that determine how a container
|
||||
is secured before getting created and run. A security context resides on the container and represents the runtime parameters that will
|
||||
be used to create and run the container via container APIs. A *security context provider* is passed to the Kubelet so it can have a chance
|
||||
to mutate Docker API calls in order to apply the security context.
|
||||
|
||||
It is recommended that this design be implemented in two phases:
|
||||
|
||||
1. Implement the security context provider extension point in the Kubelet
|
||||
so that a default security context can be applied on container run and creation.
|
||||
2. Implement a security context structure that is part of a service account. The
|
||||
default context provider can then be used to apply a security context based
|
||||
on the service account associated with the pod.
|
||||
|
||||
### Security Context Provider
|
||||
|
||||
The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type SecurityContextProvider interface {
|
||||
// ModifyContainerConfig is called before the Docker createContainer call.
|
||||
// The security context provider can make changes to the Config with which
|
||||
// the container is created.
|
||||
// An error is returned if it's not possible to secure the container as
|
||||
// requested with a security context.
|
||||
ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config)
|
||||
|
||||
// ModifyHostConfig is called before the Docker runContainer call.
|
||||
// The security context provider can make changes to the HostConfig, affecting
|
||||
// security options, whether the container is privileged, volume binds, etc.
|
||||
// An error is returned if it's not possible to secure the container as requested
|
||||
// with a security context.
|
||||
ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig)
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today.
|
||||
|
||||
### Security Context
|
||||
|
||||
A security context resides on the container and represents the runtime parameters that will
|
||||
be used to create and run the container via container APIs. Following is an example of an initial implementation:
|
||||
|
||||
{% highlight go %}
|
||||
{% raw %}
|
||||
type Container struct {
|
||||
... other fields omitted ...
|
||||
// Optional: SecurityContext defines the security options the pod should be run with
|
||||
SecurityContext *SecurityContext
|
||||
}
|
||||
|
||||
// SecurityContext holds security configuration that will be applied to a container. SecurityContext
|
||||
// contains duplication of some existing fields from the Container resource. These duplicate fields
|
||||
// will be populated based on the Container configuration if they are not set. Defining them on
|
||||
// both the Container AND the SecurityContext will result in an error.
|
||||
type SecurityContext struct {
|
||||
// Capabilities are the capabilities to add/drop when running the container
|
||||
Capabilities *Capabilities
|
||||
|
||||
// Run the container in privileged mode
|
||||
Privileged *bool
|
||||
|
||||
// SELinuxOptions are the labels to be applied to the container
|
||||
// and volumes
|
||||
SELinuxOptions *SELinuxOptions
|
||||
|
||||
// RunAsUser is the UID to run the entrypoint of the container process.
|
||||
RunAsUser *int64
|
||||
}
|
||||
|
||||
// SELinuxOptions are the labels to be applied to the container.
|
||||
type SELinuxOptions struct {
|
||||
// SELinux user label
|
||||
User string
|
||||
|
||||
// SELinux role label
|
||||
Role string
|
||||
|
||||
// SELinux type label
|
||||
Type string
|
||||
|
||||
// SELinux level label.
|
||||
Level string
|
||||
}
|
||||
{% endraw %}
|
||||
{% endhighlight %}
|
||||
|
||||
### Admission
|
||||
|
||||
It is up to an admission plugin to determine if the security context is acceptable or not. At the
|
||||
time of writing, the admission control plugin for security contexts will only allow a context that
|
||||
has defined capabilities or privileged. Contexts that attempt to define a UID or SELinux options
|
||||
will be denied by default. In the future the admission plugin will base this decision upon
|
||||
configurable policies that reside within the [service account](http://pr.k8s.io/2297).
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: IS_VERSIONED -->
|
||||
<!-- TAG IS_VERSIONED -->
|
||||
<!-- END MUNGE: IS_VERSIONED -->
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
||||
|