Add seccomp default feature blog post
This adds the blog post about the new Kubernetes `SeccompDefault` alpha feature. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>pull/28951/head
parent
86aa6c434d
commit
84e472e95c
|
@ -0,0 +1,267 @@
|
|||
---
|
||||
layout: blog
|
||||
title: "Enable seccomp for all workloads with a new v1.22 alpha feature"
|
||||
date: 2021-08-25
|
||||
slug: seccomp-default
|
||||
---
|
||||
|
||||
**Author:** Sascha Grunert, Red Hat
|
||||
|
||||
This blog post is about a new Kubernetes feature introduced in v1.22, which adds
|
||||
an additional security layer on top of the existing seccomp support. Seccomp is
|
||||
a security mechanism for Linux processes to filter system calls (syscalls) based
|
||||
on a set of defined rules. Applying seccomp profiles to containerized workloads
|
||||
is one of the key tasks when it comes to enhancing the security of the
|
||||
application deployment. Developers, site reliability engineers and
|
||||
infrastructure administrators have to work hand in hand to create, distribute
|
||||
and maintain the profiles over the applications life-cycle.
|
||||
|
||||
You can use the [`securityContext`][seccontext] field of Pods and their
|
||||
containers can be used to adjust security related configurations of the
|
||||
workload. Kubernetes introduced dedicated [seccomp related API
|
||||
fields][seccontext] in this `SecurityContext` with the [graduation of seccomp to
|
||||
General Availability (GA)][ga] in v1.19.0. This enhancement allowed an easier
|
||||
way to specify if the whole pod or a specific container should run as:
|
||||
|
||||
[seccontext]: /docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1
|
||||
[ga]: https://kubernetes.io/blog/2020/08/26/kubernetes-release-1.19-accentuate-the-paw-sitive/#graduated-to-stable
|
||||
|
||||
- `Unconfined`: seccomp will not be enabled
|
||||
- `RuntimeDefault`: the container runtimes default profile will be used
|
||||
- `Localhost`: a node local profile will be applied, which is being referenced
|
||||
by a relative path to the seccomp profile root (`<kubelet-root-dir>/seccomp`)
|
||||
of the kubelet
|
||||
|
||||
With the graduation of seccomp, nothing has changed from an overall security
|
||||
perspective, because `Unconfined` is still the default. This is totally fine if
|
||||
you consider this from the upgrade path and backwards compatibility perspective of
|
||||
Kubernetes releases. But it also means that it is more likely that a workload
|
||||
runs without seccomp at all, which should be fixed in the long term.
|
||||
|
||||
## `SeccompDefault` to the rescue
|
||||
|
||||
Kubernetes v1.22.0 introduces a new kubelet [feature gate][gate]
|
||||
`SeccompDefault`, which has been added in `alpha` state as every other new
|
||||
feature. This means that it is disabled by default and can be enabled manually
|
||||
for every single Kubernetes node.
|
||||
|
||||
[gate]: /docs/reference/command-line-tools-reference/feature-gates
|
||||
|
||||
What does the feature do? Well, it just changes the default seccomp profile from
|
||||
`Unconfined` to `RuntimeDefault`. If not specified differently in the pod
|
||||
manifest, then the feature will add a higher set of security constraints by
|
||||
using the default profile of the container runtime. These profiles may differ
|
||||
between runtimes like [CRI-O][crio] or [containerd][ctrd]. They also differ for
|
||||
its used hardware architectures. But generally speaking, those default profiles
|
||||
allow a common amount of syscalls while blocking the more dangerous ones, which
|
||||
are unlikely or unsafe to be used in a containerized application.
|
||||
|
||||
[crio]: https://github.com/cri-o/cri-o/blob/fe30d62/vendor/github.com/containers/common/pkg/seccomp/default_linux.go#L45
|
||||
[ctrd]: https://github.com/containerd/containerd/blob/e1445df/contrib/seccomp/seccomp_default.go#L51
|
||||
|
||||
### Enabling the feature
|
||||
|
||||
Two kubelet configuration changes have to be made to enable the feature:
|
||||
|
||||
1. **Enable the feature** gate by setting the `SeccompDefault=true` via the command
|
||||
line (`--feature-gates`) or the [kubelet configuration][kubelet] file.
|
||||
2. **Turn on the feature** by enabling the feature by adding the
|
||||
`--seccomp-default` command line flag or via the [kubelet
|
||||
configuration][kubelet] file (`seccompDefault: true`).
|
||||
|
||||
[kubelet]: /docs/tasks/administer-cluster/kubelet-config-file
|
||||
|
||||
The kubelet will error on startup if only one of the above steps have been done.
|
||||
|
||||
### Trying it out
|
||||
|
||||
If the feature is enabled on a node, then you can create a new workload like
|
||||
this:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: test-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: test-container
|
||||
image: nginx:1.21
|
||||
```
|
||||
|
||||
Now it is possible to inspect the used seccomp profile by using
|
||||
[`crictl`][crictl] while investigating the containers [runtime
|
||||
specification][rspec]:
|
||||
|
||||
[crictl]: https://github.com/kubernetes-sigs/cri-tools
|
||||
[rspec]: https://github.com/opencontainers/runtime-spec/blob/0c021c1/config-linux.md#seccomp
|
||||
|
||||
```bash
|
||||
CONTAINER_ID=$(sudo crictl ps -q --name=test-container)
|
||||
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp
|
||||
```
|
||||
|
||||
```yaml
|
||||
{
|
||||
"defaultAction": "SCMP_ACT_ERRNO",
|
||||
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
|
||||
"syscalls": [
|
||||
{
|
||||
"names": ["_llseek", "_newselect", "accept", …, "write", "writev"],
|
||||
"action": "SCMP_ACT_ALLOW"
|
||||
},
|
||||
…
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can see that the lower level container runtime ([CRI-O][crio-home] and
|
||||
[runc][runc] in our case), successfully applied the default seccomp profile.
|
||||
This profile denies all syscalls per default, while allowing commonly used ones
|
||||
like [`accept`][accept] or [`write`][write].
|
||||
|
||||
[crio-home]: https://github.com/cri-o/cri-o
|
||||
[runc]: https://github.com/opencontainers/runc
|
||||
[accept]: https://man7.org/linux/man-pages/man2/accept.2.html
|
||||
[write]: https://man7.org/linux/man-pages/man2/write.2.html
|
||||
|
||||
Please note that the feature will not influence any Kubernetes API for now.
|
||||
Therefore, it is not possible to retrieve the used seccomp profile via `kubectl`
|
||||
`get` or `describe` if the [`SeccompProfile`][api] field is unset within the
|
||||
`SecurityContext`.
|
||||
|
||||
[api]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1
|
||||
|
||||
The feature also works when using multiple containers within a pod, for example
|
||||
if you create a pod like this:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: test-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: test-container-nginx
|
||||
image: nginx:1.21
|
||||
securityContext:
|
||||
seccompProfile:
|
||||
type: Unconfined
|
||||
- name: test-container-redis
|
||||
image: redis:6.2
|
||||
```
|
||||
|
||||
then you should see that the `test-container-nginx` runs without a seccomp profile:
|
||||
|
||||
```bash
|
||||
sudo crictl inspect $(sudo crictl ps -q --name=test-container-nginx) |
|
||||
jq '.info.runtimeSpec.linux.seccomp == null'
|
||||
true
|
||||
```
|
||||
|
||||
Whereas the container `test-container-redis` runs with `RuntimeDefault`:
|
||||
|
||||
```bash
|
||||
sudo crictl inspect $(sudo crictl ps -q --name=test-container-redis) |
|
||||
jq '.info.runtimeSpec.linux.seccomp != null'
|
||||
true
|
||||
```
|
||||
|
||||
The same applies to the pod itself, which also runs with the default profile:
|
||||
|
||||
```bash
|
||||
sudo crictl inspectp (sudo crictl pods -q --name test-pod) |
|
||||
jq '.info.runtimeSpec.linux.seccomp != null'
|
||||
true
|
||||
```
|
||||
|
||||
### Upgrade strategy
|
||||
|
||||
It is recommended to enable the feature in multiple steps, whereas different
|
||||
risks and mitigations exist for each one.
|
||||
|
||||
#### Feature gate enabling
|
||||
|
||||
Enabling the feature gate at the kubelet level will not turn on the feature, but
|
||||
will make it possible by using the `SeccompDefault` kubelet configuration or the
|
||||
`--seccomp-default` CLI flag. This can be done by an administrator for the whole
|
||||
cluster or only a set of nodes.
|
||||
|
||||
#### Testing the Application
|
||||
|
||||
If you're trying this within a dedicated test environment, you have to ensure
|
||||
that the application code does not trigger syscalls blocked by the
|
||||
`RuntimeDefault` profile before enabling the feature on a node. This can be done
|
||||
by:
|
||||
|
||||
- _Recommended_: Analyzing the code (manually or by running the application with
|
||||
[strace][strace]) for any executed syscalls which may be blocked by the
|
||||
default profiles. If that's the case, then you can override the default by
|
||||
explicitly setting the pod or container to run as `Unconfined`. Alternatively,
|
||||
you can create a custom seccomp profile (see optional step below).
|
||||
profile based on the default by adding the additional syscalls to the
|
||||
`"action": "SCMP_ACT_ALLOW"` section.
|
||||
|
||||
- _Recommended_: Manually set the profile to the target workload and use a
|
||||
rolling upgrade to deploy into production. Rollback the deployment if the
|
||||
application does not work as intended.
|
||||
|
||||
- _Optional_: Run the application against an end-to-end test suite to trigger
|
||||
all relevant code paths with `RuntimeDefault` enabled. If a test fails, use
|
||||
the same mitigation as mentioned above.
|
||||
|
||||
- _Optional_: Create a custom seccomp profile based on the default and change
|
||||
its default action from `SCMP_ACT_ERRNO` to `SCMP_ACT_LOG`. This means that
|
||||
the seccomp filter for unknown syscalls will have no effect on the application
|
||||
at all, but the system logs will now indicate which syscalls may be blocked.
|
||||
This requires at least a Kernel version 4.14 as well as a recent [runc][runc]
|
||||
release. Monitor the application hosts audit logs (defaults to
|
||||
`/var/log/audit/audit.log`) or syslog entries (defaults to `/var/log/syslog`)
|
||||
for syscalls via `type=SECCOMP` (for audit) or `type=1326` (for syslog).
|
||||
Compare the syscall ID with those [listed in the Linux Kernel
|
||||
sources][syscalls] and add them to the custom profile. Be aware that custom
|
||||
audit policies may lead into missing syscalls, depending on the configuration
|
||||
of auditd.
|
||||
|
||||
- _Optional_: Use cluster additions like the [Security Profiles Operator][spo]
|
||||
for profiling the application via its [log enrichment][logs] capabilities or
|
||||
recording a profile by using its [recording feature][rec]. This makes the
|
||||
above mentioned manual log investigation obsolete.
|
||||
|
||||
[syscalls]: https://github.com/torvalds/linux/blob/7bb7f2a/arch/x86/entry/syscalls/syscall_64.tbl
|
||||
[spo]: https://github.com/kubernetes-sigs/security-profiles-operator
|
||||
[logs]: https://github.com/kubernetes-sigs/security-profiles-operator/blob/c90ef3a/installation-usage.md#record-profiles-from-workloads-with-profilerecordings
|
||||
[rec]: https://github.com/kubernetes-sigs/security-profiles-operator/blob/c90ef3a/installation-usage.md#using-the-log-enricher
|
||||
[strace]: https://man7.org/linux/man-pages/man1/strace.1.html
|
||||
|
||||
#### Deploying the modified application
|
||||
|
||||
Based on the outcome of the application tests, it may be required to change the
|
||||
application deployment by either specifying `Unconfined` or a custom seccomp
|
||||
profile. This is not the case if the application works as intended with
|
||||
`RuntimeDefault`.
|
||||
|
||||
#### Enable the kubelet configuration
|
||||
|
||||
If everything went well, then the feature is ready to be enabled by the kubelet
|
||||
configuration or its corresponding CLI flag. This should be done on a per-node
|
||||
basis to reduce the overall risk of missing a syscall during the investigations
|
||||
when running the application tests. If it's possible to monitor audit logs
|
||||
within the cluster, then it's recommended to do this for eventually missed
|
||||
seccomp events. If the application works as intended then the feature can be
|
||||
enabled for further nodes within the cluster.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Thank you for reading this blog post! I hope you enjoyed to see how the usage of
|
||||
seccomp profiles has been evolved in Kubernetes over the past releases as much
|
||||
as I do. On your own cluster, change the default seccomp profile to
|
||||
`RuntimeDefault` (using this new feature) and see the security benefits, and, of
|
||||
course, feel free to reach out any time for feedback or questions.
|
||||
|
||||
---
|
||||
|
||||
_Editor's note: If you have any questions or feedback about this blog post, feel
|
||||
free to reach out via the [Kubernetes slack in #sig-node][slack]._
|
||||
|
||||
[slack]: https://kubernetes.slack.com/messages/sig-node
|
Loading…
Reference in New Issue