Add seccomp notifier blog post

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
pull/37896/head
Sascha Grunert 2022-11-14 13:18:32 +01:00
parent d3a2bb63f5
commit 8ef95bf9a6
No known key found for this signature in database
GPG Key ID: 09D97D153EF94D93
1 changed files with 346 additions and 0 deletions

View File

@ -0,0 +1,346 @@
---
layout: blog
title: "Finding suspicious syscalls with the seccomp notifier"
date: 2022-12-02
slug: seccomp-notifier
---
**Authors:** Sascha Grunert
Debugging software in production is one of the biggest challenges we have to
face in our containerized environments. Being able to understand the impact of
the available security options, especially when it comes to configuring our
deployments, is one of the key aspects to make the default security in
Kubernetes stronger. We have all those logging, tracing and metrics data already
at hand, but how do we assemble the information they provide into something
human readable and actionable?
[Seccomp][seccomp] is one of the standard mechanisms to protect a Linux based
Kubernetes application from malicious actions by interfering with its [system
calls][syscalls]. This allows us to restrict the application to a defined set of
actionable items, like modifying files or responding to HTTP requests. Linking
the knowledge of which set of syscalls is required to, for example, modify a
local file, to the actual source code is in the same way non-trivial. Seccomp
profiles for Kubernetes have to be written in [JSON][json] and can be understood
as an architecture specific allow-list with superpowers, for example:
[seccomp]: https://en.wikipedia.org/wiki/Seccomp
[syscalls]: https://en.wikipedia.org/wiki/Syscall
[json]: https://www.json.org
```json
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 38,
"defaultErrno": "ENOSYS",
"syscalls": [
{
"names": ["chmod", "chown", "open", "write"],
"action": "SCMP_ACT_ALLOW"
}
]
}
```
The above profile errors by default specifying the `defaultAction` of
`SCMP_ACT_ERRNO`. This means we have to allow a set of syscalls via
`SCMP_ACT_ALLOW`, otherwise the application would not be able to do anything at
all. Okay cool, for being able to allow file operations, all we have to do is
adding a bunch of file specific syscalls like `open` or `write`, and probably
also being able to change the permissions via `chmod` and `chown`, right?
Basically yes, but there are issues with the simplicity of that approach:
Seccomp profiles need to include the minimum set of syscalls required to start
the application. This also includes some syscalls from the lower level
[Open Container Initiative (OCI)][oci] container runtime, for example
[runc][runc] or [crun][crun]. Beside that, we can only guarantee the required
syscalls for a very specific version of the runtimes and our application,
because the code parts can change between releases. The same applies to the
termination of the application as well as the target architecture we're
deploying on. Features like executing commands within containers also require
another subset of syscalls. Not to mention that there are multiple versions for
syscalls doing slightly different things and the seccomp profiles are able to
modify their arguments. It's also not always clearly visible to the developers
which syscalls are used by their own written code parts, because they rely on
programming language abstractions or frameworks.
[oci]: https://opencontainers.org
[runc]: https://github.com/opencontainers/runc
[crun]: https://github.com/containers/crun
_How can we know which syscalls are even required then? Who should create and
maintain those profiles during its development life-cycle?_
Well, recording and distributing seccomp profiles is one of the problem domains
of the [Security Profiles Operator][spo], which is already solving that. The
operator is able to record [seccomp][seccomp], [SELinux][selinux] and even
[AppArmor][apparmor] profiles into a [Custom Resource Definition (CRD)][crd],
reconciles them to each node and makes them available for usage.
[spo]: https://github.com/kubernetes-sigs/security-profiles-operator
[selinux]: https://en.wikipedia.org/wiki/Security-Enhanced_Linux
[apparmor]: https://en.wikipedia.org/wiki/AppArmor
[crd]: https://k8s.io/docs/concepts/extend-kubernetes/api-extension/custom-resources
The biggest challenge about creating security profiles is to catch all code
paths which execute syscalls. We could achieve that by having **100%** logical
coverage of the application when running an end-to-end test suite. You get the
problem with the previous statement: It's too idealistic to be ever fulfilled,
even without taking all the moving parts during application development and
deployment into account.
Missing a syscall in the seccomp profiles' allow list can have tremendously
negative impact on the application. It's not only that we can encounter crashes,
which are trivially detectable. It can also happen that they slightly change
logical paths, change the business logic, make parts of the application
unusable, slow down performance or even expose security vulnerabilities. We're
simply not able to see the whole impact of that, especially because blocked
syscalls via `SCMP_ACT_ERRNO` do not provide any additional [audit][audit]
logging on the system.
[audit]: https://linux.die.net/man/8/auditd
Does that mean we're lost? Is it just not realistic to dream about a Kubernetes
where [everyone uses the default seccomp profile][seccomp-default]? Should we
stop striving towards maximum security in Kubernetes and accept that it's not
meant to be secure by default?
[seccomp-default]: https://github.com/kubernetes/enhancements/issues/2413
**Definitely not.** Technology evolves over time and there are many folks
working behind the scenes of Kubernetes to indirectly deliver features to
address such problems. One of the mentioned features is the _seccomp notifier_,
which can be used to find suspicious syscalls in Kubernetes.
The seccomp notify feature consists of a set of changes introduced in Linux 5.9.
It makes the kernel capable of communicating seccomp related events to the user
space. That allows applications to act based on the syscalls and opens for a
wide range of possible use cases. We not only need the right kernel version,
but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier
work at all. The Kubernetes container runtime [CRI-O][cri-o] gets [support for
the seccomp notifier in v1.26.0][cri-o-notifier]. The new feature allows us to
identify possibly malicious syscalls in our application, and therefore makes it
possible to verify profiles for consistency and completeness. Let's give that a
try.
[cri-o]: https://cri-o.io
[cri-o-notifier]: https://github.com/cri-o/cri-o/pull/6120
First of all we need to run the latest `main` version of CRI-O, because v1.26.0
has not been released yet at time of writing. You can do that by either
compiling it from the [source code][sources] or by using the pre-built binary
bundle via [the get-script][script]. The seccomp notifier feature of CRI-O is
guarded by an annotation, which has to be explicitly allowed, for example by
using a configuration drop-in like this:
```console
> cat /etc/crio/crio.conf.d/02-runtimes.conf
```
```toml
[crio.runtime]
default_runtime = "runc"
[crio.runtime.runtimes.runc]
allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ]
```
[sources]: https://github.com/cri-o/cri-o/blob/main/install.md#build-and-install-cri-o-from-source
[script]: https://github.com/cri-o/cri-o#installing-cri-o
If CRI-O is up and running, then it should indicate that the seccomp notifier is
available as well:
```console
> sudo ./bin/crio --enable-metrics
INFO[…] Starting seccomp notifier watcher
INFO[…] Serving metrics on :9090 via HTTP
```
We also enable the metrics, because they provide additional telemetry data about
the notifier. Now we need a running Kubernetes cluster for demonstration
purposes. For this demo, we mainly stick to the
[`hack/local-up-cluster.sh`][local-up] approach to locally spawn a single node
Kubernetes cluster.
[local-up]: https://github.com/cri-o/cri-o#running-kubernetes-with-cri-o
If everything is up and running, then we would have to define a seccomp profile
for testing purposes. But we do not have to create our own, we can just use the
`RuntimeDefault` profile which gets shipped with each container runtime. For
example the `RuntimeDefault` profile for CRI-O can be found in the
[containers/common][runtime-default] library.
[runtime-default]: https://github.com/containers/common/blob/afff1d6/pkg/seccomp/seccomp.json
Now we need a test container, which can be a simple [nginx][nginx] pod like
this:
[nginx]: https://www.nginx.com
```yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
annotations:
io.kubernetes.cri-o.seccompNotifierAction: "stop"
spec:
restartPolicy: Never
containers:
- name: nginx
image: nginx:1.23.2
securityContext:
seccompProfile:
type: RuntimeDefault
```
Please note the annotation `io.kubernetes.cri-o.seccompNotifierAction`, which
enables the seccomp notifier for this workload. The value of the annotation can
be either `stop` for stopping the workload or anything else for doing nothing
else than logging and throwing metrics. Because of the termination we also use
the `restartPolicy: Never` to not automatically recreate the container on
failure.
Let's run the pod and check if it works:
```console
> kubectl apply -f nginx.yaml
```
```console
> kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 3m39s 10.85.0.3 127.0.0.1 <none> <none>
```
We can also test if the web server itself works as intended:
```console
> curl 10.85.0.3
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
```
While everything is now up and running, CRI-O also indicates that it has started
the seccomp notifier:
```
INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb
```
If we would now run a forbidden syscall inside of the container, then we can
expect that the workload gets terminated. Let's give that a try by running
`chroot` in the containers namespaces:
```console
> kubectl exec -it nginx -- bash
```
```console
root@nginx:/# chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
root@nginx:/# command terminated with exit code 137
```
The exec session got terminated, so it looks like the container is not running
any more:
```console
> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 seccomp killed 0 96s
```
Alright, the container got killed by seccomp, do we get any more information
about what was going on?
```console
> kubectl describe pod nginx
Name: nginx
Containers:
nginx:
State: Terminated
Reason: seccomp killed
Message: Used forbidden syscalls: chroot (1x)
Exit Code: 137
Started: Mon, 14 Nov 2022 12:19:46 +0100
Finished: Mon, 14 Nov 2022 12:20:26 +0100
```
The seccomp notifier feature of CRI-O correctly set the termination reason and
message, including which forbidden syscall has been used how often (`1x`). How
often? Yes, the notifier gives the application up to 5 seconds after the last
seen syscall until it starts the termination. This means that it's possible to
catch multiple forbidden syscalls within one test by avoiding time-consuming
trial and errors.
```console
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
```
```console
> kubectl describe pod nginx | grep Message
Message: Used forbidden syscalls: chroot (2x), swapoff (2x)
```
The CRI-O metrics will also reflect that:
```console
> curl -sf localhost:9090/metrics | grep seccomp_notifier
# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1
```
How does it work in detail? CRI-O uses the chosen seccomp profile and injects
the action `SCMP_ACT_NOTIFY` instead of `SCMP_ACT_ERRNO`, `SCMP_ACT_KILL`,
`SCMP_ACT_KILL_PROCESS` or `SCMP_ACT_KILL_THREAD`. It also sets a local listener
path which will be used by the lower level OCI runtime (runc or crun) to create
the seccomp notifier socket. If the connection between the socket and CRI-O has
been established, then CRI-O will receive notifications for each syscall being
interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for
them to arrive and then terminates the container if the chosen
`seccompNotifierAction=stop`. Unfortunately, the seccomp notifier is not able to
notify on the `defaultAction`, which means that it's required to have
a list of syscalls to test for custom profiles. CRI-O does also state that
limitation in the logs:
```log
INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
which means that syscalls using that default action can't be traced by the notifier
```
As a conclusion, the seccomp notifier implementation in CRI-O can be used to
verify if your applications behave correctly when using `RuntimeDefault` or any
other custom profile. Alerts can be created based on the metrics to create long
running test scenarios around that feature. Making seccomp understandable and
easier to use will increase adoption as well as help us to move towards a more
secure Kubernetes by default!
Thank you for reading this blog post. If you'd like to read more about the
seccomp notifier, checkout the following resources:
- The Seccomp Notifier - New Frontiers in Unprivileged Container Development: https://brauner.io/2020/07/23/seccomp-notify.html
- Bringing Seccomp Notify to Runc and Kubernetes: https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-to-runc-and-kubernetes
- Seccomp Agent reference implementation: https://github.com/opencontainers/runc/tree/6b16d00/contrib/cmd/seccompagent