From 8ef95bf9a6358e9d5027b66116b8979fbcd0cb35 Mon Sep 17 00:00:00 2001
From: Sascha Grunert <sgrunert@redhat.com>
Date: Mon, 14 Nov 2022 13:18:32 +0100
Subject: [PATCH] Add seccomp notifier blog post

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
---
 .../_posts/2022-12-02-seccomp-notifier.md     | 346 ++++++++++++++++++
 1 file changed, 346 insertions(+)
 create mode 100644 content/en/blog/_posts/2022-12-02-seccomp-notifier.md

diff --git a/content/en/blog/_posts/2022-12-02-seccomp-notifier.md b/content/en/blog/_posts/2022-12-02-seccomp-notifier.md
new file mode 100644
index 0000000000..55b4e502a9
--- /dev/null
+++ b/content/en/blog/_posts/2022-12-02-seccomp-notifier.md
@@ -0,0 +1,346 @@
+---
+layout: blog
+title: "Finding suspicious syscalls with the seccomp notifier"
+date: 2022-12-02
+slug: seccomp-notifier
+---
+
+**Authors:** Sascha Grunert
+
+Debugging software in production is one of the biggest challenges we have to
+face in our containerized environments. Being able to understand the impact of
+the available security options, especially when it comes to configuring our
+deployments, is one of the key aspects to make the default security in
+Kubernetes stronger. We have all those logging, tracing and metrics data already
+at hand, but how do we assemble the information they provide into something
+human readable and actionable?
+
+[Seccomp][seccomp] is one of the standard mechanisms to protect a Linux based
+Kubernetes application from malicious actions by interfering with its [system
+calls][syscalls]. This allows us to restrict the application to a defined set of
+actionable items, like modifying files or responding to HTTP requests. Linking
+the knowledge of which set of syscalls is required to, for example, modify a
+local file, to the actual source code is in the same way non-trivial. Seccomp
+profiles for Kubernetes have to be written in [JSON][json] and can be understood
+as an architecture specific allow-list with superpowers, for example:
+
+[seccomp]: https://en.wikipedia.org/wiki/Seccomp
+[syscalls]: https://en.wikipedia.org/wiki/Syscall
+[json]: https://www.json.org
+
+```json
+{
+  "defaultAction": "SCMP_ACT_ERRNO",
+  "defaultErrnoRet": 38,
+  "defaultErrno": "ENOSYS",
+  "syscalls": [
+    {
+      "names": ["chmod", "chown", "open", "write"],
+      "action": "SCMP_ACT_ALLOW"
+    }
+  ]
+}
+```
+
+The above profile errors by default specifying the `defaultAction` of
+`SCMP_ACT_ERRNO`. This means we have to allow a set of syscalls via
+`SCMP_ACT_ALLOW`, otherwise the application would not be able to do anything at
+all. Okay cool, for being able to allow file operations, all we have to do is
+adding a bunch of file specific syscalls like `open` or `write`, and probably
+also being able to change the permissions via `chmod` and `chown`, right?
+Basically yes, but there are issues with the simplicity of that approach:
+
+Seccomp profiles need to include the minimum set of syscalls required to start
+the application. This also includes some syscalls from the lower level
+[Open Container Initiative (OCI)][oci] container runtime, for example
+[runc][runc] or [crun][crun]. Beside that, we can only guarantee the required
+syscalls for a very specific version of the runtimes and our application,
+because the code parts can change between releases. The same applies to the
+termination of the application as well as the target architecture we're
+deploying on. Features like executing commands within containers also require
+another subset of syscalls. Not to mention that there are multiple versions for
+syscalls doing slightly different things and the seccomp profiles are able to
+modify their arguments. It's also not always clearly visible to the developers
+which syscalls are used by their own written code parts, because they rely on
+programming language abstractions or frameworks.
+
+[oci]: https://opencontainers.org
+[runc]: https://github.com/opencontainers/runc
+[crun]: https://github.com/containers/crun
+
+_How can we know which syscalls are even required then? Who should create and
+maintain those profiles during its development life-cycle?_
+
+Well, recording and distributing seccomp profiles is one of the problem domains
+of the [Security Profiles Operator][spo], which is already solving that. The
+operator is able to record [seccomp][seccomp], [SELinux][selinux] and even
+[AppArmor][apparmor] profiles into a [Custom Resource Definition (CRD)][crd],
+reconciles them to each node and makes them available for usage.
+
+[spo]: https://github.com/kubernetes-sigs/security-profiles-operator
+[selinux]: https://en.wikipedia.org/wiki/Security-Enhanced_Linux
+[apparmor]: https://en.wikipedia.org/wiki/AppArmor
+[crd]: https://k8s.io/docs/concepts/extend-kubernetes/api-extension/custom-resources
+
+The biggest challenge about creating security profiles is to catch all code
+paths which execute syscalls. We could achieve that by having **100%** logical
+coverage of the application when running an end-to-end test suite. You get the
+problem with the previous statement: It's too idealistic to be ever fulfilled,
+even without taking all the moving parts during application development and
+deployment into account.
+
+Missing a syscall in the seccomp profiles' allow list can have tremendously
+negative impact on the application. It's not only that we can encounter crashes,
+which are trivially detectable. It can also happen that they slightly change
+logical paths, change the business logic, make parts of the application
+unusable, slow down performance or even expose security vulnerabilities. We're
+simply not able to see the whole impact of that, especially because blocked
+syscalls via `SCMP_ACT_ERRNO` do not provide any additional [audit][audit]
+logging on the system.
+
+[audit]: https://linux.die.net/man/8/auditd
+
+Does that mean we're lost? Is it just not realistic to dream about a Kubernetes
+where [everyone uses the default seccomp profile][seccomp-default]? Should we
+stop striving towards maximum security in Kubernetes and accept that it's not
+meant to be secure by default?
+
+[seccomp-default]: https://github.com/kubernetes/enhancements/issues/2413
+
+**Definitely not.** Technology evolves over time and there are many folks
+working behind the scenes of Kubernetes to indirectly deliver features to
+address such problems. One of the mentioned features is the _seccomp notifier_,
+which can be used to find suspicious syscalls in Kubernetes.
+
+The seccomp notify feature consists of a set of changes introduced in Linux 5.9.
+It makes the kernel capable of communicating seccomp related events to the user
+space. That allows applications to act based on the syscalls and opens for a
+wide range of possible use cases. We not only need the right kernel version,
+but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier
+work at all. The Kubernetes container runtime [CRI-O][cri-o] gets [support for
+the seccomp notifier in v1.26.0][cri-o-notifier]. The new feature allows us to
+identify possibly malicious syscalls in our application, and therefore makes it
+possible to verify profiles for consistency and completeness. Let's give that a
+try.
+
+[cri-o]: https://cri-o.io
+[cri-o-notifier]: https://github.com/cri-o/cri-o/pull/6120
+
+First of all we need to run the latest `main` version of CRI-O, because v1.26.0
+has not been released yet at time of writing. You can do that by either
+compiling it from the [source code][sources] or by using the pre-built binary
+bundle via [the get-script][script]. The seccomp notifier feature of CRI-O is
+guarded by an annotation, which has to be explicitly allowed, for example by
+using a configuration drop-in like this:
+
+```console
+> cat /etc/crio/crio.conf.d/02-runtimes.conf
+```
+
+```toml
+[crio.runtime]
+default_runtime = "runc"
+
+[crio.runtime.runtimes.runc]
+allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ]
+```
+
+[sources]: https://github.com/cri-o/cri-o/blob/main/install.md#build-and-install-cri-o-from-source
+[script]: https://github.com/cri-o/cri-o#installing-cri-o
+
+If CRI-O is up and running, then it should indicate that the seccomp notifier is
+available as well:
+
+```console
+> sudo ./bin/crio --enable-metrics
+…
+INFO[…] Starting seccomp notifier watcher
+INFO[…] Serving metrics on :9090 via HTTP
+…
+```
+
+We also enable the metrics, because they provide additional telemetry data about
+the notifier. Now we need a running Kubernetes cluster for demonstration
+purposes. For this demo, we mainly stick to the
+[`hack/local-up-cluster.sh`][local-up] approach to locally spawn a single node
+Kubernetes cluster.
+
+[local-up]: https://github.com/cri-o/cri-o#running-kubernetes-with-cri-o
+
+If everything is up and running, then we would have to define a seccomp profile
+for testing purposes. But we do not have to create our own, we can just use the
+`RuntimeDefault` profile which gets shipped with each container runtime. For
+example the `RuntimeDefault` profile for CRI-O can be found in the
+[containers/common][runtime-default] library.
+
+[runtime-default]: https://github.com/containers/common/blob/afff1d6/pkg/seccomp/seccomp.json
+
+Now we need a test container, which can be a simple [nginx][nginx] pod like
+this:
+
+[nginx]: https://www.nginx.com
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: nginx
+  annotations:
+    io.kubernetes.cri-o.seccompNotifierAction: "stop"
+spec:
+  restartPolicy: Never
+  containers:
+    - name: nginx
+      image: nginx:1.23.2
+      securityContext:
+        seccompProfile:
+          type: RuntimeDefault
+```
+
+Please note the annotation `io.kubernetes.cri-o.seccompNotifierAction`, which
+enables the seccomp notifier for this workload. The value of the annotation can
+be either `stop` for stopping the workload or anything else for doing nothing
+else than logging and throwing metrics. Because of the termination we also use
+the `restartPolicy: Never` to not automatically recreate the container on
+failure.
+
+Let's run the pod and check if it works:
+
+```console
+> kubectl apply -f nginx.yaml
+```
+
+```console
+> kubectl get pods -o wide
+NAME    READY   STATUS    RESTARTS   AGE     IP          NODE        NOMINATED NODE   READINESS GATES
+nginx   1/1     Running   0          3m39s   10.85.0.3   127.0.0.1   <none>           <none>
+```
+
+We can also test if the web server itself works as intended:
+
+```console
+> curl 10.85.0.3
+<!DOCTYPE html>
+<html>
+<head>
+<title>Welcome to nginx!</title>
+…
+```
+
+While everything is now up and running, CRI-O also indicates that it has started
+the seccomp notifier:
+
+```
+…
+INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb
+…
+```
+
+If we would now run a forbidden syscall inside of the container, then we can
+expect that the workload gets terminated. Let's give that a try by running
+`chroot` in the containers namespaces:
+
+```console
+> kubectl exec -it nginx -- bash
+```
+
+```console
+root@nginx:/# chroot /tmp
+chroot: cannot change root directory to '/tmp': Function not implemented
+root@nginx:/# command terminated with exit code 137
+```
+
+The exec session got terminated, so it looks like the container is not running
+any more:
+
+```console
+> kubectl get pods
+NAME    READY   STATUS           RESTARTS   AGE
+nginx   0/1     seccomp killed   0          96s
+```
+
+Alright, the container got killed by seccomp, do we get any more information
+about what was going on?
+
+```console
+> kubectl describe pod nginx
+Name:             nginx
+…
+Containers:
+  nginx:
+    …
+    State:          Terminated
+      Reason:       seccomp killed
+      Message:      Used forbidden syscalls: chroot (1x)
+      Exit Code:    137
+      Started:      Mon, 14 Nov 2022 12:19:46 +0100
+      Finished:     Mon, 14 Nov 2022 12:20:26 +0100
+…
+```
+
+The seccomp notifier feature of CRI-O correctly set the termination reason and
+message, including which forbidden syscall has been used how often (`1x`). How
+often? Yes, the notifier gives the application up to 5 seconds after the last
+seen syscall until it starts the termination. This means that it's possible to
+catch multiple forbidden syscalls within one test by avoiding time-consuming
+trial and errors.
+
+```console
+> kubectl exec -it nginx -- chroot /tmp
+chroot: cannot change root directory to '/tmp': Function not implemented
+command terminated with exit code 125
+> kubectl exec -it nginx -- chroot /tmp
+chroot: cannot change root directory to '/tmp': Function not implemented
+command terminated with exit code 125
+> kubectl exec -it nginx -- swapoff -a
+command terminated with exit code 32
+> kubectl exec -it nginx -- swapoff -a
+command terminated with exit code 32
+```
+
+```console
+> kubectl describe pod nginx | grep Message
+      Message:      Used forbidden syscalls: chroot (2x), swapoff (2x)
+```
+
+The CRI-O metrics will also reflect that:
+
+```console
+> curl -sf localhost:9090/metrics | grep seccomp_notifier
+# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
+# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
+container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
+container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1
+```
+
+How does it work in detail? CRI-O uses the chosen seccomp profile and injects
+the action `SCMP_ACT_NOTIFY` instead of `SCMP_ACT_ERRNO`, `SCMP_ACT_KILL`,
+`SCMP_ACT_KILL_PROCESS` or `SCMP_ACT_KILL_THREAD`. It also sets a local listener
+path which will be used by the lower level OCI runtime (runc or crun) to create
+the seccomp notifier socket. If the connection between the socket and CRI-O has
+been established, then CRI-O will receive notifications for each syscall being
+interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for
+them to arrive and then terminates the container if the chosen
+`seccompNotifierAction=stop`. Unfortunately, the seccomp notifier is not able to
+notify on the `defaultAction`, which means that it's required to have
+a list of syscalls to test for custom profiles. CRI-O does also state that
+limitation in the logs:
+
+```log
+INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
+        which means that syscalls using that default action can't be traced by the notifier
+```
+
+As a conclusion, the seccomp notifier implementation in CRI-O can be used to
+verify if your applications behave correctly when using `RuntimeDefault` or any
+other custom profile. Alerts can be created based on the metrics to create long
+running test scenarios around that feature. Making seccomp understandable and
+easier to use will increase adoption as well as help us to move towards a more
+secure Kubernetes by default!
+
+Thank you for reading this blog post. If you'd like to read more about the
+seccomp notifier, checkout the following resources:
+
+- The Seccomp Notifier - New Frontiers in Unprivileged Container Development: https://brauner.io/2020/07/23/seccomp-notify.html
+- Bringing Seccomp Notify to Runc and Kubernetes: https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-to-runc-and-kubernetes
+- Seccomp Agent reference implementation: https://github.com/opencontainers/runc/tree/6b16d00/contrib/cmd/seccompagent