374 lines
17 KiB
Markdown
374 lines
17 KiB
Markdown
---
|
|
assignees:
|
|
- mikedanese
|
|
- thockin
|
|
title: Managing Compute Resources
|
|
---
|
|
|
|
* TOC
|
|
{:toc}
|
|
|
|
When specifying a [pod](/docs/user-guide/pods), you can optionally specify how much CPU and memory (RAM) each
|
|
container needs. When containers have their resource requests specified, the scheduler is
|
|
able to make better decisions about which nodes to place pods on; and when containers have their
|
|
limits specified, contention for resources on a node can be handled in a specified manner. For
|
|
more details about the difference between requests and limits, please refer to
|
|
[Resource QoS](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/resource-qos.md).
|
|
|
|
*CPU* and *memory* are each a *resource type*. A resource type has a base unit. CPU is specified
|
|
in units of cores. Memory is specified in units of bytes.
|
|
|
|
CPU and RAM are collectively referred to as *compute resources*, or just *resources*. Compute
|
|
resources are measureable quantities which can be requested, allocated, and consumed. They are
|
|
distinct from [API resources](/docs/user-guide/working-with-resources). API resources, such as pods and
|
|
[services](/docs/user-guide/services) are objects that can be written to and retrieved from the Kubernetes API
|
|
server.
|
|
|
|
## Resource Requests and Limits of Pod and Container
|
|
|
|
Each container of a pod can optionally specify one or more of the following:
|
|
|
|
* `spec.containers[].resources.limits.cpu`
|
|
* `spec.containers[].resources.limits.memory`
|
|
* `spec.containers[].resources.requests.cpu`
|
|
* `spec.containers[].resources.requests.memory`.
|
|
|
|
Specifying resource requests and/or limits is optional. In some clusters, unset limits or requests
|
|
may be replaced with default values when a pod is created or updated. The default value depends on
|
|
how the cluster is configured. If the requests values are not specified, they are set to be equal
|
|
to the limits values by default. Please note that limits must always be greater than or equal to
|
|
requests.
|
|
|
|
Although requests/limits can only be specified on individual containers, it is convenient to talk
|
|
about pod resource requests/limits. A *pod resource request/limit* for a particular resource
|
|
type is the sum of the resource requests/limits of that type for each container in the pod, with
|
|
unset values treated as zero (or equal to default values in some cluster configurations).
|
|
|
|
### Meaning of CPU
|
|
Limits and requests for `cpu` are measured in cpus.
|
|
One cpu, in Kubernetes, is equivalent to:
|
|
|
|
- 1 AWS vCPU
|
|
- 1 GCP Core
|
|
- 1 Azure vCore
|
|
- 1 *Hyperthread* on a bare-metal Intel processor with Hyperthreading
|
|
|
|
Fractional requests are allowed. A container with `spec.containers[].resources.requests.cpu` of `0.5` will
|
|
be guaranteed half as much CPU as one that asks for `1`. The expression `0.1` is equivalent to the expression
|
|
`100m`, which can be read as "one hundred millicpu" (some may say "one hundred millicores", and this is understood
|
|
to mean the same thing when talking about Kubernetes). A request with a decimal point, like `0.1` is converted to
|
|
`100m` by the API, and precision finer than `1m` is not allowed. For this reason, the form `100m` may be preferred.
|
|
|
|
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of cpu on a single
|
|
core, dual core, or 48 core machine.
|
|
|
|
# Meaning of Memory
|
|
|
|
Limits and requests for `memory` are measured in bytes.
|
|
Memory can be expressed a plain integer or as fixed-point integers with one of these SI suffixes (E, P, T, G, M, K)
|
|
or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, Ki). For example, the following represent roughly the same value:
|
|
`128974848`, `129e6`, `129M` , `123Mi`.
|
|
|
|
### Example
|
|
The following pod has two containers. Each has a request of 0.25 core of cpu and 64MiB
|
|
(2<sup>26</sup> bytes) of memory and a limit of 0.5 core of cpu and 128MiB of memory. The pod can
|
|
be said to have a request of 0.5 core and 128 MiB of memory and a limit of 1 core and 256MiB of
|
|
memory.
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: frontend
|
|
spec:
|
|
containers:
|
|
- name: db
|
|
image: mysql
|
|
resources:
|
|
requests:
|
|
memory: "64Mi"
|
|
cpu: "250m"
|
|
limits:
|
|
memory: "128Mi"
|
|
cpu: "500m"
|
|
- name: wp
|
|
image: wordpress
|
|
resources:
|
|
requests:
|
|
memory: "64Mi"
|
|
cpu: "250m"
|
|
limits:
|
|
memory: "128Mi"
|
|
cpu: "500m"
|
|
```
|
|
|
|
## How Pods with Resource Requests are Scheduled
|
|
|
|
When a pod is created, the Kubernetes scheduler selects a node for the pod to
|
|
run on. Each node has a maximum capacity for each of the resource types: the
|
|
amount of CPU and memory it can provide for pods. The scheduler ensures that,
|
|
for each resource type (CPU and memory), the sum of the resource requests of the
|
|
containers scheduled to the node is less than the capacity of the node. Note
|
|
that although actual memory or CPU resource usage on nodes is very low, the
|
|
scheduler will still refuse to place pods onto nodes if the capacity check
|
|
fails. This protects against a resource shortage on a node when resource usage
|
|
later increases, such as due to a daily peak in request rate.
|
|
|
|
## How Pods with Resource Limits are Run
|
|
|
|
When kubelet starts a container of a pod, it passes the CPU and memory limits to the container
|
|
runner (Docker or rkt).
|
|
|
|
When using Docker:
|
|
|
|
- The `spec.containers[].resources.requests.cpu` is converted to its core value (potentially fractional),
|
|
and multiplied by 1024, and used as the value of the [`--cpu-shares`](
|
|
https://docs.docker.com/reference/run/#runtime-constraints-on-resources) flag to the `docker run`
|
|
command.
|
|
- The `spec.containers[].resources.limits.cpu` is converted to its millicore value,
|
|
multiplied by 100000, and then divided by 1000, and used as the value of the [`--cpu-quota`](
|
|
https://docs.docker.com/reference/run/#runtime-constraints-on-resources) flag to the `docker run`
|
|
command. The [`--cpu-period`] flag is set to 100000 which represents the default 100ms period
|
|
for measuring quota usage. The kubelet enforces cpu limits if it was started with the
|
|
[`--cpu-cfs-quota`] flag set to true. As of version 1.2, this flag will now default to true.
|
|
- The `spec.containers[].resources.limits.memory` is converted to an integer, and used as the value
|
|
of the [`--memory`](https://docs.docker.com/reference/run/#runtime-constraints-on-resources) flag
|
|
to the `docker run` command.
|
|
|
|
**TODO: document behavior for rkt**
|
|
|
|
If a container exceeds its memory limit, it may be terminated. If it is restartable, it will be
|
|
restarted by kubelet, as will any other type of runtime failure.
|
|
|
|
A container may or may not be allowed to exceed its CPU limit for extended periods of time.
|
|
However, it will not be killed for excessive CPU usage.
|
|
|
|
To determine if a container cannot be scheduled or is being killed due to resource limits, see the
|
|
"Troubleshooting" section below.
|
|
|
|
## Monitoring Compute Resource Usage
|
|
|
|
The resource usage of a pod is reported as part of the Pod status.
|
|
|
|
If [optional monitoring](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/cluster-monitoring/README.md) is configured for your cluster,
|
|
then pod resource usage can be retrieved from the monitoring system.
|
|
|
|
## Troubleshooting
|
|
|
|
### My pods are pending with event message failedScheduling
|
|
|
|
If the scheduler cannot find any node where a pod can fit, then the pod will remain unscheduled
|
|
until a place can be found. An event will be produced each time the scheduler fails to find a
|
|
place for the pod, like this:
|
|
|
|
```shell
|
|
$ kubectl describe pod frontend | grep -A 3 Events
|
|
Events:
|
|
FirstSeen LastSeen Count From Subobject PathReason Message
|
|
36s 5s 6 {scheduler } FailedScheduling Failed for reason PodExceedsFreeCPU and possibly others
|
|
```
|
|
|
|
In the case shown above, the pod "frontend" fails to be scheduled due to insufficient
|
|
CPU resource on the node. Similar error messages can also suggest failure due to insufficient
|
|
memory (PodExceedsFreeMemory). In general, if a pod or pods are pending with this message and
|
|
alike, then there are several things to try:
|
|
|
|
- Add more nodes to the cluster.
|
|
- Terminate unneeded pods to make room for pending pods.
|
|
- Check that the pod is not larger than all the nodes. For example, if all the nodes
|
|
have a capacity of `cpu: 1`, then a pod with a limit of `cpu: 1.1` will never be scheduled.
|
|
|
|
You can check node capacities and amounts allocated with the `kubectl describe nodes` command.
|
|
For example:
|
|
|
|
```shell
|
|
$ kubectl describe nodes gke-cluster-4-386701dd-node-ww4p
|
|
Name: gke-cluster-4-386701dd-node-ww4p
|
|
[ ... lines removed for clarity ...]
|
|
Capacity:
|
|
cpu: 1
|
|
memory: 464Mi
|
|
pods: 40
|
|
Allocated resources (total requests):
|
|
cpu: 910m
|
|
memory: 2370Mi
|
|
pods: 4
|
|
[ ... lines removed for clarity ...]
|
|
Pods: (4 in total)
|
|
Namespace Name CPU(milliCPU) Memory(bytes)
|
|
frontend webserver-ffj8j 500 (50% of total) 2097152000 (50% of total)
|
|
kube-system fluentd-cloud-logging-gke-cluster-4-386701dd-node-ww4p 100 (10% of total) 209715200 (5% of total)
|
|
kube-system kube-dns-v8-qopgw 310 (31% of total) 178257920 (4% of total)
|
|
TotalResourceLimits:
|
|
CPU(milliCPU): 910 (91% of total)
|
|
Memory(bytes): 2485125120 (59% of total)
|
|
[ ... lines removed for clarity ...]
|
|
```
|
|
|
|
Here you can see from the `Allocated resources` section that that a pod which ask for more than
|
|
90 millicpus or more than 1341MiB of memory will not be able to fit on this node.
|
|
|
|
Looking at the `Pods` section, you can see which pods are taking up space on the node.
|
|
|
|
The [resource quota](/docs/admin/resourcequota/) feature can be configured
|
|
to limit the total amount of resources that can be consumed. If used in conjunction
|
|
with namespaces, it can prevent one team from hogging all the resources.
|
|
|
|
### My container is terminated
|
|
|
|
Your container may be terminated because it's resource-starved. To check if a container is being killed because it is hitting a resource limit, call `kubectl describe pod`
|
|
on the pod you are interested in:
|
|
|
|
```shell
|
|
[12:54:41] $ ./cluster/kubectl.sh describe pod simmemleak-hra99
|
|
Name: simmemleak-hra99
|
|
Namespace: default
|
|
Image(s): saadali/simmemleak
|
|
Node: kubernetes-node-tf0f/10.240.216.66
|
|
Labels: name=simmemleak
|
|
Status: Running
|
|
Reason:
|
|
Message:
|
|
IP: 10.244.2.75
|
|
Replication Controllers: simmemleak (1/1 replicas created)
|
|
Containers:
|
|
simmemleak:
|
|
Image: saadali/simmemleak
|
|
Limits:
|
|
cpu: 100m
|
|
memory: 50Mi
|
|
State: Running
|
|
Started: Tue, 07 Jul 2015 12:54:41 -0700
|
|
Last Termination State: Terminated
|
|
Exit Code: 1
|
|
Started: Fri, 07 Jul 2015 12:54:30 -0700
|
|
Finished: Fri, 07 Jul 2015 12:54:33 -0700
|
|
Ready: False
|
|
Restart Count: 5
|
|
Conditions:
|
|
Type Status
|
|
Ready False
|
|
Events:
|
|
FirstSeen LastSeen Count From SubobjectPath Reason Message
|
|
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {scheduler } scheduled Successfully assigned simmemleak-hra99 to kubernetes-node-tf0f
|
|
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD pulled Pod container image "gcr.io/google_containers/pause:0.8.0" already present on machine
|
|
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD created Created with docker id 6a41280f516d
|
|
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} implicitly required container POD started Started with docker id 6a41280f516d
|
|
Tue, 07 Jul 2015 12:53:51 -0700 Tue, 07 Jul 2015 12:53:51 -0700 1 {kubelet kubernetes-node-tf0f} spec.containers{simmemleak} created Created with docker id 87348f12526a
|
|
```
|
|
|
|
The `Restart Count: 5` indicates that the `simmemleak` container in this pod was terminated and restarted 5 times.
|
|
|
|
You can call `get pod` with the `-o go-template=...` option to fetch the status of previously terminated containers:
|
|
|
|
```shell{% raw %}
|
|
[13:59:01] $ ./cluster/kubectl.sh get pod -o go-template='{{range.status.containerStatuses}}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}{{.lastState}}{{end}}' simmemleak-60xbc
|
|
Container Name: simmemleak
|
|
LastState: map[terminated:map[exitCode:137 reason:OOM Killed startedAt:2015-07-07T20:58:43Z finishedAt:2015-07-07T20:58:43Z containerID:docker://0e4095bba1feccdfe7ef9fb6ebffe972b4b14285d5acdec6f0d3ae8a22fad8b2]]{% endraw %}
|
|
```
|
|
|
|
We can see that this container was terminated because `reason:OOM Killed`, where *OOM* stands for Out Of Memory.
|
|
|
|
## Opaque Integer Resources (Alpha Feature)
|
|
|
|
Kubernetes version 1.5 introduces Opaque integer resources. Opaque
|
|
integer resources allow cluster operators to advertise new node-level
|
|
resources that would be otherwise unknown to the system.
|
|
|
|
Users can consume these resources in pod specs just like CPU and memory.
|
|
The scheduler takes care of the resource accounting so that no more than the
|
|
available amount is simultaneously allocated to pods.
|
|
|
|
**Note:** Opaque integer resources are Alpha in Kubernetes version 1.5.
|
|
Only resource accounting is implemented; node-level isolation is still
|
|
under active development.
|
|
|
|
Opaque integer resources are resources that begin with the prefix
|
|
`pod.alpha.kubernetes.io/opaque-int-resource-`. The API server
|
|
restricts quantities of these resources to whole numbers. Examples of
|
|
_valid_ quantities are `3`, `3000m` and `3Ki`. Examples of _invalid_
|
|
quantities are `0.5` and `1500m`.
|
|
|
|
There are two steps required to use opaque integer resources. First, the
|
|
cluster operator must advertise a per-node opaque resource on one or more
|
|
nodes. Second, users must request the opaque resource in pods.
|
|
|
|
To advertise a new opaque integer resource, the cluster operator should
|
|
submit a `PATCH` HTTP request to the API server to specify the available
|
|
quantity in the `status.capacity` for a node in the cluster. After this
|
|
operation, the node's `status.capacity` will include a new resource. The
|
|
`status.allocatable` field is updated automatically with the new resource
|
|
asychronously by the Kubelet. Note that since the scheduler uses the
|
|
node `status.allocatable` value when evaluating pod fitness, there may
|
|
be a short delay between patching the node capacity with a new resource and the
|
|
first pod that requests the resource to be scheduled on that node.
|
|
|
|
**Example:**
|
|
|
|
The HTTP request below advertises 5 "foo" resources on node `k8s-node-1`.
|
|
|
|
_NOTE: `~1` is the encoding for the character `/` in the patch path.
|
|
The operation path value in JSON-Patch is interpreted as a JSON-Pointer.
|
|
For more details, please refer to
|
|
[IETF RFC 6901, section 3](https://tools.ietf.org/html/rfc6901#section-3)._
|
|
|
|
```http
|
|
PATCH /api/v1/nodes/k8s-node-1/status HTTP/1.1
|
|
Accept: application/json
|
|
Content-Type: application/json-patch+json
|
|
Host: k8s-master:8080
|
|
|
|
[
|
|
{
|
|
"op": "add",
|
|
"path": "/status/capacity/pod.alpha.kubernetes.io~1opaque-int-resource-foo",
|
|
"value": "5"
|
|
}
|
|
]
|
|
```
|
|
|
|
To consume opaque resources in pods, include the name of the opaque
|
|
resource as a key in the `spec.containers[].resources.requests` map.
|
|
|
|
The pod will be scheduled only if all of the resource requests are
|
|
satisfied (including cpu, memory and any opaque resources.) The pod will
|
|
remain in the `PENDING` state while the resource request cannot be met by any
|
|
node.
|
|
|
|
**Example:**
|
|
|
|
The pod below requests 2 cpus and 1 "foo" (an opaque resource.)
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: my-pod
|
|
spec:
|
|
containers:
|
|
- name: my-container
|
|
image: myimage
|
|
resources:
|
|
requests:
|
|
cpu: 2
|
|
pod.alpha.kubernetes.io/opaque-int-resource-foo: 1
|
|
```
|
|
|
|
## Planned Improvements
|
|
|
|
The current system only allows resource quantities to be specified on a container.
|
|
It is planned to improve accounting for resources which are shared by all containers in a pod,
|
|
such as [EmptyDir volumes](/docs/user-guide/volumes/#emptydir).
|
|
|
|
The current system only supports container requests and limits for CPU and Memory.
|
|
It is planned to add new resource types, including a node disk space
|
|
resource, and a framework for adding custom [resource types](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/resources.md#resource-types).
|
|
|
|
Kubernetes supports overcommitment of resources by supporting multiple levels of [Quality of Service](http://issue.k8s.io/168).
|
|
|
|
Currently, one unit of CPU means different things on different cloud providers, and on different
|
|
machine types within the same cloud providers. For example, on AWS, the capacity of a node
|
|
is reported in [ECUs](http://aws.amazon.com/ec2/faqs/), while in GCE it is reported in logical
|
|
cores. We plan to revise the definition of the cpu resource to allow for more consistency
|
|
across providers and platforms.
|