2019-04-04 15:04:57 +00:00
|
|
|
|
---
|
|
|
|
|
layout: blog
|
|
|
|
|
title: 'Kubernetes 1.14: Local Persistent Volumes GA'
|
|
|
|
|
date: 2019-04-04
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
**Authors**: Michelle Au (Google), Matt Schallert (Uber), Celina Ward (Uber)
|
|
|
|
|
|
|
|
|
|
The [Local Persistent Volumes](https://kubernetes.io/docs/concepts/storage/volumes/#local)
|
|
|
|
|
feature has been promoted to GA in Kubernetes 1.14.
|
|
|
|
|
It was first introduced as alpha in Kubernetes 1.7, and then
|
|
|
|
|
[beta](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) in Kubernetes
|
|
|
|
|
1.10. The GA milestone indicates that Kubernetes users may depend on the feature
|
|
|
|
|
and its API for production use. GA features are protected by the Kubernetes
|
|
|
|
|
[deprecation
|
|
|
|
|
policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/).
|
|
|
|
|
|
|
|
|
|
## What is a Local Persistent Volume?
|
|
|
|
|
|
|
|
|
|
A local persistent volume represents a local disk directly-attached to a single
|
|
|
|
|
Kubernetes Node.
|
|
|
|
|
|
|
|
|
|
Kubernetes provides a powerful volume plugin system that enables Kubernetes
|
|
|
|
|
workloads to use a [wide
|
|
|
|
|
variety](https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes)
|
|
|
|
|
of block and file storage to persist data. Most
|
|
|
|
|
of these plugins enable remote storage -- these remote storage systems persist
|
|
|
|
|
data independent of the Kubernetes node where the data originated. Remote
|
|
|
|
|
storage usually can not offer the consistent high performance guarantees of
|
|
|
|
|
local directly-attached storage. With the Local Persistent Volume plugin,
|
|
|
|
|
Kubernetes workloads can now consume high performance local storage using the
|
|
|
|
|
same volume APIs that app developers have become accustomed to.
|
|
|
|
|
|
|
|
|
|
## How is it different from a HostPath Volume?
|
|
|
|
|
|
|
|
|
|
To better understand the benefits of a Local Persistent Volume, it is useful to
|
|
|
|
|
compare it to a [HostPath volume](https://kubernetes.io/docs/concepts/storage/volumes/#hostpath).
|
|
|
|
|
HostPath volumes mount a file or directory from
|
|
|
|
|
the host node’s filesystem into a Pod. Similarly a Local Persistent Volume
|
|
|
|
|
mounts a local disk or partition into a Pod.
|
|
|
|
|
|
|
|
|
|
The biggest difference is that the Kubernetes scheduler understands which node a
|
|
|
|
|
Local Persistent Volume belongs to. With HostPath volumes, a pod referencing a
|
|
|
|
|
HostPath volume may be moved by the scheduler to a different node resulting in
|
|
|
|
|
data loss. But with Local Persistent Volumes, the Kubernetes scheduler ensures
|
|
|
|
|
that a pod using a Local Persistent Volume is always scheduled to the same node.
|
|
|
|
|
|
|
|
|
|
While HostPath volumes may be referenced via a Persistent Volume Claim (PVC) or
|
|
|
|
|
directly inline in a pod definition, Local Persistent Volumes can only be
|
|
|
|
|
referenced via a PVC. This provides additional security benefits since
|
|
|
|
|
Persistent Volume objects are managed by the administrator, preventing Pods from
|
|
|
|
|
being able to access any path on the host.
|
|
|
|
|
|
|
|
|
|
Additional benefits include support for formatting of block devices during
|
|
|
|
|
mount, and volume ownership using fsGroup.
|
|
|
|
|
|
|
|
|
|
## What's New With GA?
|
|
|
|
|
|
|
|
|
|
Since 1.10, we have mainly focused on improving stability and scalability of the
|
|
|
|
|
feature so that it is production ready.
|
|
|
|
|
|
|
|
|
|
The only major feature addition is the ability to specify a raw block device and
|
|
|
|
|
have Kubernetes automatically format and mount the filesystem. This reduces the
|
|
|
|
|
previous burden of having to format and mount devices before giving it to
|
|
|
|
|
Kubernetes.
|
|
|
|
|
|
|
|
|
|
## Limitations of GA
|
|
|
|
|
|
|
|
|
|
At GA, Local Persistent Volumes do not support [dynamic volume
|
|
|
|
|
provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/).
|
|
|
|
|
However there is an [external
|
|
|
|
|
controller](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner)
|
|
|
|
|
available to help manage the local
|
|
|
|
|
PersistentVolume lifecycle for individual disks on your nodes. This includes
|
|
|
|
|
creating the PersistentVolume objects, cleaning up and reusing disks once they
|
|
|
|
|
have been released by the application.
|
|
|
|
|
|
|
|
|
|
## How to Use a Local Persistent Volume?
|
|
|
|
|
|
|
|
|
|
Workloads can request a local persistent volume using the same
|
|
|
|
|
PersistentVolumeClaim interface as remote storage backends. This makes it easy
|
|
|
|
|
to swap out the storage backend across clusters, clouds, and on-prem
|
|
|
|
|
environments.
|
|
|
|
|
|
|
|
|
|
First, a StorageClass should be created that sets `volumeBindingMode:
|
|
|
|
|
WaitForFirstConsumer` to enable [volume topology-aware
|
|
|
|
|
scheduling](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode).
|
|
|
|
|
This mode instructs Kubernetes to wait to bind a PVC until a Pod using it is scheduled.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
kind: StorageClass
|
|
|
|
|
apiVersion: storage.k8s.io/v1
|
|
|
|
|
metadata:
|
|
|
|
|
name: local-storage
|
|
|
|
|
provisioner: kubernetes.io/no-provisioner
|
|
|
|
|
volumeBindingMode: WaitForFirstConsumer
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Then, the external static provisioner can be [configured and
|
|
|
|
|
run](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner#user-guide) to create PVs
|
|
|
|
|
for all the local disks on your nodes.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ kubectl get pv
|
|
|
|
|
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
|
|
|
|
|
local-pv-27c0f084 368Gi RWO Delete Available local-storage 8s
|
|
|
|
|
local-pv-3796b049 368Gi RWO Delete Available local-storage 7s
|
|
|
|
|
local-pv-3ddecaea 368Gi RWO Delete Available local-storage 7s
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Afterwards, workloads can start using the PVs by creating a PVC and Pod or a
|
|
|
|
|
StatefulSet with volumeClaimTemplates.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
apiVersion: apps/v1
|
|
|
|
|
kind: StatefulSet
|
|
|
|
|
metadata:
|
|
|
|
|
name: local-test
|
|
|
|
|
spec:
|
|
|
|
|
serviceName: "local-service"
|
|
|
|
|
replicas: 3
|
|
|
|
|
selector:
|
|
|
|
|
matchLabels:
|
|
|
|
|
app: local-test
|
|
|
|
|
template:
|
|
|
|
|
metadata:
|
|
|
|
|
labels:
|
|
|
|
|
app: local-test
|
|
|
|
|
spec:
|
|
|
|
|
containers:
|
|
|
|
|
- name: test-container
|
2023-01-28 13:26:32 +00:00
|
|
|
|
image: registry.k8s.io/busybox # updated after publication (previously used k8s.gcr.io/busybox)
|
2019-04-04 15:04:57 +00:00
|
|
|
|
command:
|
|
|
|
|
- "/bin/sh"
|
|
|
|
|
args:
|
|
|
|
|
- "-c"
|
|
|
|
|
- "sleep 100000"
|
|
|
|
|
volumeMounts:
|
|
|
|
|
- name: local-vol
|
|
|
|
|
mountPath: /usr/test-pod
|
|
|
|
|
volumeClaimTemplates:
|
|
|
|
|
- metadata:
|
|
|
|
|
name: local-vol
|
|
|
|
|
spec:
|
|
|
|
|
accessModes: [ "ReadWriteOnce" ]
|
|
|
|
|
storageClassName: "local-storage"
|
|
|
|
|
resources:
|
|
|
|
|
requests:
|
|
|
|
|
storage: 368Gi
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Once the StatefulSet is up and running, the PVCs are all bound:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ kubectl get pvc
|
|
|
|
|
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
|
|
|
|
|
local-vol-local-test-0 Bound local-pv-27c0f084 368Gi RWO local-storage 3m45s
|
|
|
|
|
local-vol-local-test-1 Bound local-pv-3ddecaea 368Gi RWO local-storage 3m40s
|
|
|
|
|
local-vol-local-test-2 Bound local-pv-3796b049 368Gi RWO local-storage 3m36s
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
When the disk is no longer needed, the PVC can be deleted. The external static provisioner
|
|
|
|
|
will clean up the disk and make the PV available for use again.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ kubectl patch sts local-test -p '{"spec":{"replicas":2}}'
|
|
|
|
|
statefulset.apps/local-test patched
|
|
|
|
|
|
|
|
|
|
$ kubectl delete pvc local-vol-local-test-2
|
|
|
|
|
persistentvolumeclaim "local-vol-local-test-2" deleted
|
|
|
|
|
|
|
|
|
|
$ kubectl get pv
|
|
|
|
|
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
|
|
|
|
|
local-pv-27c0f084 368Gi RWO Delete Bound default/local-vol-local-test-0 local-storage 11m
|
|
|
|
|
local-pv-3796b049 368Gi RWO Delete Available local-storage 7s
|
|
|
|
|
local-pv-3ddecaea 368Gi RWO Delete Bound default/local-vol-local-test-1 local-storage 19m
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
You can find full [documentation](https://kubernetes.io/docs/concepts/storage/volumes/#local)
|
|
|
|
|
for the feature on the Kubernetes website.
|
|
|
|
|
|
|
|
|
|
## What Are Suitable Use Cases?
|
|
|
|
|
|
|
|
|
|
The primary benefit of Local Persistent Volumes over remote persistent storage
|
|
|
|
|
is performance: local disks usually offer higher IOPS and throughput and lower
|
|
|
|
|
latency compared to remote storage systems.
|
|
|
|
|
|
|
|
|
|
However, there are important limitations and caveats to consider when using
|
|
|
|
|
Local Persistent Volumes:
|
|
|
|
|
|
|
|
|
|
* Using local storage ties your application to a specific node, making your
|
|
|
|
|
application harder to schedule. Applications which use local storage should
|
|
|
|
|
specify a high priority so that lower priority pods, that don’t require local
|
|
|
|
|
storage, can be preempted if necessary.
|
|
|
|
|
* If that node or local volume encounters a failure and becomes inaccessible, then
|
|
|
|
|
that pod also becomes inaccessible. Manual intervention, external controllers,
|
|
|
|
|
or operators may be needed to recover from these situations.
|
|
|
|
|
* While most remote storage systems implement synchronous replication, most local
|
|
|
|
|
disk offerings do not provide data durability guarantees. Meaning loss of the
|
|
|
|
|
disk or node may result in loss of all the data on that disk
|
|
|
|
|
|
|
|
|
|
For these reasons, local persistent storage should only be considered for
|
|
|
|
|
workloads that handle data replication and backup at the application layer, thus
|
|
|
|
|
making the applications resilient to node or data failures and unavailability
|
|
|
|
|
despite the lack of such guarantees at the individual disk level.
|
|
|
|
|
|
|
|
|
|
Examples of good workloads include software defined storage systems and
|
|
|
|
|
replicated databases. Other types of applications should continue to use highly
|
|
|
|
|
available, remotely accessible, durable storage.
|
|
|
|
|
|
|
|
|
|
## How Uber Uses Local Storage
|
|
|
|
|
|
|
|
|
|
[M3](https://eng.uber.com/m3/), Uber’s in-house metrics platform,
|
|
|
|
|
piloted Local Persistent Volumes at scale
|
|
|
|
|
in an effort to evaluate [M3DB](https://m3db.io/) —
|
|
|
|
|
an open-source, distributed timeseries database
|
|
|
|
|
created by Uber. One of M3DB’s notable features is its ability to shard its
|
|
|
|
|
metrics into partitions, replicate them by a factor of three, and then evenly
|
|
|
|
|
disperse the replicas across separate failure domains.
|
|
|
|
|
|
|
|
|
|
Prior to the pilot with local persistent volumes, M3DB ran exclusively in
|
|
|
|
|
Uber-managed environments. Over time, internal use cases arose that required the
|
|
|
|
|
ability to run M3DB in environments with fewer dependencies. So the team began
|
|
|
|
|
to explore options. As an open-source project, we wanted to provide the
|
|
|
|
|
community with a way to run M3DB as easily as possible, with an open-source
|
|
|
|
|
stack, while meeting M3DB’s requirements for high throughput, low-latency
|
|
|
|
|
storage, and the ability to scale itself out.
|
|
|
|
|
|
|
|
|
|
The Kubernetes Local Persistent Volume interface, with its high-performance,
|
|
|
|
|
low-latency guarantees, quickly emerged as the perfect abstraction to build on
|
|
|
|
|
top of. With Local Persistent Volumes, individual M3DB instances can comfortably
|
|
|
|
|
handle up to 600k writes per-second. This leaves plenty of headroom for spikes
|
|
|
|
|
on clusters that typically process a few million metrics per-second.
|
|
|
|
|
|
|
|
|
|
Because M3DB also gracefully handles losing a single node or volume, the limited
|
|
|
|
|
data durability guarantees of Local Persistent Volumes are not an issue. If a
|
|
|
|
|
node fails, M3DB finds a suitable replacement and the new node begins streaming
|
|
|
|
|
data from its two peers.
|
|
|
|
|
|
|
|
|
|
Thanks to the Kubernetes scheduler’s intelligent handling of volume topology,
|
|
|
|
|
M3DB is able to programmatically evenly disperse its replicas across multiple
|
|
|
|
|
local persistent volumes in all available cloud zones, or, in the case of
|
|
|
|
|
on-prem clusters, across all available server racks.
|
|
|
|
|
|
|
|
|
|
## Uber's Operational Experience
|
|
|
|
|
|
|
|
|
|
As mentioned above, while Local Persistent Volumes provide many benefits, they
|
|
|
|
|
also require careful planning and careful consideration of constraints before
|
|
|
|
|
committing to them in production. When thinking about our local volume strategy
|
|
|
|
|
for M3DB, there were a few things Uber had to consider.
|
|
|
|
|
|
|
|
|
|
For one, we had to take into account the hardware profiles of the nodes in our
|
|
|
|
|
Kubernetes cluster. For example, how many local disks would each node cluster
|
|
|
|
|
have? How would they be partitioned?
|
|
|
|
|
|
2019-04-26 00:56:28 +00:00
|
|
|
|
The local static provisioner provides
|
|
|
|
|
[guidance](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/blob/master/docs/best-practices.md)
|
|
|
|
|
to help answer these questions. It’s best to be able to dedicate a full disk to each local volume
|
2019-04-04 15:04:57 +00:00
|
|
|
|
(for IO isolation) and a full partition per-volume (for capacity isolation).
|
|
|
|
|
This was easier in our cloud environments where we could mix and match local
|
|
|
|
|
disks. However, if using local volumes on-prem, hardware constraints may be a
|
|
|
|
|
limiting factor depending on the number of disks available and their
|
|
|
|
|
characteristics.
|
|
|
|
|
|
|
|
|
|
When first testing local volumes, we wanted to have a thorough understanding of
|
|
|
|
|
the effect
|
|
|
|
|
[disruptions](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/)
|
|
|
|
|
(voluntary and involuntary) would have on pods using
|
|
|
|
|
local storage, and so we began testing some failure scenarios. We found that
|
|
|
|
|
when a local volume becomes unavailable while the node remains available (such
|
|
|
|
|
as when performing maintenance on the disk), a pod using the local volume will
|
|
|
|
|
be stuck in a ContainerCreating state until it can mount the volume. If a node
|
|
|
|
|
becomes unavailable, for example if it is removed from the cluster or is
|
|
|
|
|
[drained](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/),
|
|
|
|
|
then pods using local volumes on that node are stuck in an Unknown or
|
|
|
|
|
Pending state depending on whether or not the node was removed gracefully.
|
|
|
|
|
|
|
|
|
|
Recovering pods from these interim states means having to delete the PVC binding
|
|
|
|
|
the pod to its local volume and then delete the pod in order for it to be
|
|
|
|
|
rescheduled (or wait until the node and disk are available again). We took this
|
|
|
|
|
into account when building our [operator](https://github.com/m3db/m3db-operator)
|
|
|
|
|
for M3DB, which makes changes to the
|
|
|
|
|
cluster topology when a pod is rescheduled such that the new one gracefully
|
|
|
|
|
streams data from the remaining two peers. Eventually we plan to automate the
|
|
|
|
|
deletion and rescheduling process entirely.
|
|
|
|
|
|
|
|
|
|
Alerts on pod states can help call attention to stuck local volumes, and
|
|
|
|
|
workload-specific controllers or operators can remediate them automatically.
|
|
|
|
|
Because of these constraints, it’s best to exclude nodes with local volumes from
|
|
|
|
|
automatic upgrades or repairs, and in fact some cloud providers explicitly
|
|
|
|
|
mention this as a best practice.
|
|
|
|
|
|
|
|
|
|
## Portability Between On-Prem and Cloud
|
|
|
|
|
|
|
|
|
|
Local Volumes played a big role in Uber’s decision to build orchestration for
|
|
|
|
|
M3DB using Kubernetes, in part because it is a storage abstraction that works
|
|
|
|
|
the same across on-prem and cloud environments. Remote storage solutions have
|
|
|
|
|
different characteristics across cloud providers, and some users may prefer not
|
|
|
|
|
to use networked storage at all in their own data centers. On the other hand,
|
|
|
|
|
local disks are relatively ubiquitous and provide more predictable performance
|
|
|
|
|
characteristics.
|
|
|
|
|
|
|
|
|
|
By orchestrating M3DB using local disks in the cloud, where it was easier to get
|
|
|
|
|
up and running with Kubernetes, we gained confidence that we could still use our
|
|
|
|
|
operator to run M3DB in our on-prem environment without any modifications. As we
|
|
|
|
|
continue to work on how we’d run Kubernetes on-prem, having solved such an
|
|
|
|
|
important pending question is a big relief.
|
|
|
|
|
|
|
|
|
|
## What's Next for Local Persistent Volumes?
|
|
|
|
|
|
|
|
|
|
As we’ve seen with Uber’s M3DB, local persistent volumes have successfully been
|
|
|
|
|
used in production environments. As adoption of local persistent volumes
|
|
|
|
|
continues to increase, SIG Storage continues to seek feedback for ways to
|
|
|
|
|
improve the feature.
|
|
|
|
|
|
|
|
|
|
One of the most frequent asks has been for a controller that can help with
|
|
|
|
|
recovery from failed nodes or disks, which is currently a manual process (or
|
|
|
|
|
something that has to be built into an operator). SIG Storage is investigating
|
|
|
|
|
creating a common controller that can be used by workloads with simple and
|
|
|
|
|
similar recovery processes.
|
|
|
|
|
|
|
|
|
|
Another popular ask has been to support dynamic provisioning using lvm. This can
|
|
|
|
|
simplify disk management, and improve disk utilization. SIG Storage is
|
|
|
|
|
evaluating the performance tradeoffs for the viability of this feature.
|
|
|
|
|
|
2019-11-26 04:49:10 +00:00
|
|
|
|
## Getting Involved
|
2019-04-04 15:04:57 +00:00
|
|
|
|
|
|
|
|
|
If you have feedback for this feature or are interested in getting involved with
|
|
|
|
|
the design and development, join the [Kubernetes Storage
|
|
|
|
|
Special-Interest-Group](https://github.com/kubernetes/community/blob/master/sig-storage/README.md)
|
|
|
|
|
(SIG). We’re rapidly growing and always welcome new contributors.
|
|
|
|
|
|
|
|
|
|
Special thanks to all the contributors that helped bring this feature to GA,
|
|
|
|
|
including Chuqiang Li (lichuqiang), Dhiraj Hedge (dhirajh), Ian Chakeres
|
|
|
|
|
(ianchakeres), Jan Šafránek (jsafrane), Michelle Au (msau42), Saad Ali
|
|
|
|
|
(saad-ali), Yecheng Fu (cofyc) and Yuquan Ren (nickrenren).
|