website/content/en/blog/_posts/2020-09-01-generic-ephemera...

395 lines
16 KiB
Markdown

---
layout: blog
title: 'Ephemeral volumes with storage capacity tracking: EmptyDir on steroids'
date: 2020-09-01
slug: ephemeral-volumes-with-storage-capacity-tracking
---
**Author:** Patrick Ohly (Intel)
Some applications need additional storage but don't care whether that
data is stored persistently across restarts. For example, caching
services are often limited by memory size and can move infrequently
used data into storage that is slower than memory with little impact
on overall performance. Other applications expect some read-only input
data to be present in files, like configuration data or secret keys.
Kubernetes already supports several kinds of such [ephemeral
volumes](/docs/concepts/storage/ephemeral-volumes), but the
functionality of those is limited to what is implemented inside
Kubernetes.
[CSI ephemeral volumes](https://kubernetes.io/blog/2020/01/21/csi-ephemeral-inline-volumes/)
made it possible to extend Kubernetes with CSI
drivers that provide light-weight, local volumes. These [*inject
arbitrary states, such as configuration, secrets, identity, variables
or similar
information*](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190122-csi-inline-volumes.md#motivation).
CSI drivers must be modified to support this Kubernetes feature,
i.e. normal, standard-compliant CSI drivers will not work, and
by design such volumes are supposed to be usable on whatever node
is chosen for a pod.
This is problematic for volumes which consume significant resources on
a node or for special storage that is only available on some nodes.
Therefore, Kubernetes 1.19 introduces two new alpha features for
volumes that are conceptually more like the `EmptyDir` volumes:
- [*generic* ephemeral volumes](/docs/concepts/storage/ephemeral-volumes#generic-ephemeral-volumes) and
- [CSI storage capacity tracking](/docs/concepts/storage/storage-capacity).
The advantages of the new approach are:
- Storage can be local or network-attached.
- Volumes can have a fixed size that applications are never able to exceed.
- Works with any CSI driver that supports provisioning of persistent
volumes and (for capacity tracking) implements the CSI `GetCapacity` call.
- Volumes may have some initial data, depending on the driver and
parameters.
- All of the typical volume operations (snapshotting,
resizing, the future storage capacity tracking, etc.)
are supported.
- The volumes are usable with any app controller that accepts
a Pod or volume specification.
- The Kubernetes scheduler itself picks suitable nodes, i.e. there is
no need anymore to implement and configure scheduler extenders and
mutating webhooks.
This makes generic ephemeral volumes a suitable solution for several
use cases:
# Use cases
## Persistent Memory as DRAM replacement for memcached
Recent releases of memcached added [support for using Persistent
Memory](https://memcached.org/blog/persistent-memory/) (PMEM) instead
of standard DRAM. When deploying memcached through one of the app
controllers, generic ephemeral volumes make it possible to request a PMEM volume
of a certain size from a CSI driver like
[PMEM-CSI](https://intel.github.io/pmem-csi/).
## Local LVM storage as scratch space
Applications working with data sets that exceed the RAM size can
request local storage with performance characteristics or size that is
not met by the normal Kubernetes `EmptyDir` volumes. For example,
[TopoLVM](https://github.com/cybozu-go/topolvm) was written for that
purpose.
## Read-only access to volumes with data
Provisioning a volume might result in a non-empty volume:
- [restore a snapshot](/docs/concepts/storage/persistent-volumes/#volume-snapshot-and-restore-volume-from-snapshot-support)
- [cloning a volume](/docs/concepts/storage/volume-pvc-datasource)
- [generic data populators](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20200120-generic-data-populators.md)
Such volumes can be mounted read-only.
# How it works
## Generic ephemeral volumes
The key idea behind generic ephemeral volumes is that a new volume
source, the so-called
[`EphemeralVolumeSource`](/docs/reference/generated/kubernetes-api/#ephemeralvolumesource-v1alpha1-core)
contains all fields that are needed to created a volume claim
(historically called persistent volume claim, PVC). A new controller
in the `kube-controller-manager` waits for Pods which embed such a
volume source and then creates a PVC for that pod. To a CSI driver
deployment, that PVC looks like any other, so no special support is
needed.
As long as these PVCs exist, they can be used like any other volume claim. In
particular, they can be referenced as data source in volume cloning or
snapshotting. The PVC object also holds the current status of the
volume.
Naming of the automatically created PVCs is deterministic: the name is
a combination of Pod name and volume name, with a hyphen (`-`) in the
middle. This deterministic naming makes it easier to
interact with the PVC because one does not have to search for it once
the Pod name and volume name are known. The downside is that the name might
be in use already. This is detected by Kubernetes and then blocks Pod
startup.
To ensure that the volume gets deleted together with the pod, the
controller makes the Pod the owner of the volume claim. When the Pod
gets deleted, the normal garbage-collection mechanism also removes the
claim and thus the volume.
Claims select the storage driver through the normal storage class
mechanism. Although storage classes with both immediate and late
binding (aka `WaitForFirstConsumer`) are supported, for ephemeral
volumes it makes more sense to use `WaitForFirstConsumer`: then Pod
scheduling can take into account both node utilization and
availability of storage when choosing a node. This is where the other
new feature comes in.
## Storage capacity tracking
Normally, the Kubernetes scheduler has no information about where a
CSI driver might be able to create a volume. It also has no way of
talking directly to a CSI driver to retrieve that information. It
therefore tries different nodes until it finds one where all volumes
can be made available (late binding) or leaves it entirely to the
driver to choose a location (immediate binding).
The new [`CSIStorageCapacity` alpha
API](/docs/reference/generated/kubernetes-api/v1.19/#csistoragecapacity-v1alpha1-storage-k8s-io)
allows storing the necessary information in etcd where it is available to the
scheduler. In contrast to support for generic ephemeral volumes,
storage capacity tracking must be [enabled when deploying a CSI
driver](https://github.com/kubernetes-csi/external-provisioner/blob/master/README.md#capacity-support):
the `external-provisioner` must be told to publish capacity
information that it then retrieves from the CSI driver through the normal
`GetCapacity` call.
<!-- TODO: update the link with a revision once https://github.com/kubernetes-csi/external-provisioner/pull/450 is merged -->
When the Kubernetes scheduler needs to choose a node for a Pod with an
unbound volume that uses late binding and the CSI driver deployment
has opted into the feature by setting the [`CSIDriver.storageCapacity`
flag](/docs/reference/generated/kubernetes-api/v1.19/#csidriver-v1beta1-storage-k8s-io)
flag, the scheduler automatically filters out nodes that do not have
access to enough storage capacity. This works for generic ephemeral
and persistent volumes but *not* for CSI ephemeral volumes because the
parameters of those are opaque for Kubernetes.
As usual, volumes with immediate binding get created before scheduling
pods, with their location chosen by the storage driver. Therefore, the
external-provisioner's default configuration skips storage
classes with immediate binding as the information wouldn't be used anyway.
Because the Kubernetes scheduler must act on potentially outdated
information, it cannot be ensured that the capacity is still available
when a volume is to be created. Still, the chances that it can be created
without retries should be higher.
# Security
## CSIStorageCapacity
CSIStorageCapacity objects are namespaced. When deploying each CSI
drivers in its own namespace and, as recommended, limiting the RBAC
permissions for CSIStorageCapacity to that namespace, it is
always obvious where the data came from. However, Kubernetes does
not check that and typically drivers get installed in the same
namespace anyway, so ultimately drivers are *expected to behave* and
not publish incorrect data.
## Generic ephemeral volumes
If users have permission to create a Pod (directly or indirectly),
then they can also create generic ephemeral volumes even when they do
not have permission to create a volume claim. That's because RBAC
permission checks are applied to the controller which creates the
PVC, not the original user. This is a fundamental change that must be
[taken into
account](/docs/concepts/storage/ephemeral-volumes#security) before
enabling the feature in clusters where untrusted users are not
supposed to have permission to create volumes.
# Example
A [special branch](https://github.com/intel/pmem-csi/commits/kubernetes-1-19-blog-post)
in PMEM-CSI contains all the necessary changes to bring up a
Kubernetes 1.19 cluster inside QEMU VMs with both alpha features
enabled. The PMEM-CSI driver code is used unchanged, only the
deployment was updated.
On a suitable machine (Linux, non-root user can use Docker - see the
[QEMU and
Kubernetes](https://intel.github.io/pmem-csi/0.7/docs/autotest.html#qemu-and-kubernetes)
section in the PMEM-CSI documentation), the following commands bring
up a cluster and install the PMEM-CSI driver:
```console
git clone --branch=kubernetes-1-19-blog-post https://github.com/intel/pmem-csi.git
cd pmem-csi
export TEST_KUBERNETES_VERSION=1.19 TEST_FEATURE_GATES=CSIStorageCapacity=true,GenericEphemeralVolume=true TEST_PMEM_REGISTRY=intel
make start && echo && test/setup-deployment.sh
```
If all goes well, the output contains the following usage
instructions:
```
The test cluster is ready. Log in with [...]/pmem-csi/_work/pmem-govm/ssh.0, run
kubectl once logged in. Alternatively, use kubectl directly with the
following env variable:
KUBECONFIG=[...]/pmem-csi/_work/pmem-govm/kube.config
secret/pmem-csi-registry-secrets created
secret/pmem-csi-node-secrets created
serviceaccount/pmem-csi-controller created
...
To try out the pmem-csi driver ephemeral volumes:
cat deploy/kubernetes-1.19/pmem-app-ephemeral.yaml |
[...]/pmem-csi/_work/pmem-govm/ssh.0 kubectl create -f -
```
The CSIStorageCapacity objects are not meant to be human-readable, so
some post-processing is needed. The following Golang template filters
all objects by the storage class that the example uses and prints the
name, topology and capacity:
```console
kubectl get \
-o go-template='{{range .items}}{{if eq .storageClassName "pmem-csi-sc-late-binding"}}{{.metadata.name}} {{.nodeTopology.matchLabels}} {{.capacity}}
{{end}}{{end}}' \
csistoragecapacities
```
```
csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi
csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 30716Mi
csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi
```
One individual object has the following content:
```console
kubectl describe csistoragecapacities/csisc-6cw8j
```
```
Name: csisc-sqdnt
Namespace: default
Labels: <none>
Annotations: <none>
API Version: storage.k8s.io/v1alpha1
Capacity: 30716Mi
Kind: CSIStorageCapacity
Metadata:
Creation Timestamp: 2020-08-11T15:41:03Z
Generate Name: csisc-
Managed Fields:
...
Owner References:
API Version: apps/v1
Controller: true
Kind: StatefulSet
Name: pmem-csi-controller
UID: 590237f9-1eb4-4208-b37b-5f7eab4597d1
Resource Version: 2994
Self Link: /apis/storage.k8s.io/v1alpha1/namespaces/default/csistoragecapacities/csisc-sqdnt
UID: da36215b-3b9d-404a-a4c7-3f1c3502ab13
Node Topology:
Match Labels:
pmem-csi.intel.com/node: pmem-csi-pmem-govm-worker1
Storage Class Name: pmem-csi-sc-late-binding
Events: <none>
```
Now let's create the example app with one generic ephemeral
volume. The `pmem-app-ephemeral.yaml` file contains:
```yaml
# This example Pod definition demonstrates
# how to use generic ephemeral inline volumes
# with a PMEM-CSI storage class.
kind: Pod
apiVersion: v1
metadata:
name: my-csi-app-inline-volume
spec:
containers:
- name: my-frontend
image: intel/pmem-csi-driver-test:v0.7.14
command: [ "sleep", "100000" ]
volumeMounts:
- mountPath: "/data"
name: my-csi-volume
volumes:
- name: my-csi-volume
ephemeral:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
storageClassName: pmem-csi-sc-late-binding
```
After creating that as shown in the usage instructions above, we have one additional Pod and PVC:
```console
kubectl get pods/my-csi-app-inline-volume -o wide
```
```
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-csi-app-inline-volume 1/1 Running 0 6m58s 10.36.0.2 pmem-csi-pmem-govm-worker1 <none> <none>
```
```console
kubectl get pvc/my-csi-app-inline-volume-my-csi-volume
```
```
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
my-csi-app-inline-volume-my-csi-volume Bound pvc-c11eb7ab-a4fa-46fe-b515-b366be908823 4Gi RWO pmem-csi-sc-late-binding 9m21s
```
That PVC is owned by the Pod:
```console
kubectl get -o yaml pvc/my-csi-app-inline-volume-my-csi-volume
```
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: pmem-csi.intel.com
volume.kubernetes.io/selected-node: pmem-csi-pmem-govm-worker1
creationTimestamp: "2020-08-11T15:44:57Z"
finalizers:
- kubernetes.io/pvc-protection
managedFields:
...
name: my-csi-app-inline-volume-my-csi-volume
namespace: default
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: true
controller: true
kind: Pod
name: my-csi-app-inline-volume
uid: 75c925bf-ca8e-441a-ac67-f190b7a2265f
...
```
Eventually, the storage capacity information for `pmem-csi-pmem-govm-worker1` also gets updated:
```
csisc-2js6n map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker2] 30716Mi
csisc-sqdnt map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker1] 26620Mi
csisc-ws4bv map[pmem-csi.intel.com/node:pmem-csi-pmem-govm-worker3] 30716Mi
```
If another app needs more than 26620Mi, the Kubernetes
scheduler will not pick `pmem-csi-pmem-govm-worker1` anymore.
# Next steps
Both features are under development. Several open questions were
already raised during the alpha review process. The two enhancement
proposals document the work that will be needed for migration to beta and what
alternatives were already considered and rejected:
* [KEP-1698: generic ephemeral inline
volumes](https://github.com/kubernetes/enhancements/blob/9d7a75d/keps/sig-storage/1698-generic-ephemeral-volumes/README.md)
* [KEP-1472: Storage Capacity
Tracking](https://github.com/kubernetes/enhancements/tree/9d7a75d/keps/sig-storage/1472-storage-capacity-tracking)
Your feedback is crucial for driving that development. SIG-Storage
[meets
regularly](https://github.com/kubernetes/community/tree/master/sig-storage#meetings)
and can be reached via [Slack and a mailing
list](https://github.com/kubernetes/community/tree/master/sig-storage#contact).