blog: introducing suspended jobs

Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
2021-03-23 14:12:01 +00:00 · 2021-03-23 14:12:01 +00:00 · cfc66c0f85
parent b41a02de59
commit cfc66c0f85
1 changed files with 110 additions and 0 deletions
--- a/content/en/blog/_posts/2021-04-12-introducing-suspended-jobs.md
+++ b/content/en/blog/_posts/2021-04-12-introducing-suspended-jobs.md
@ -0,0 +1,110 @@
+---
+title: "Introducing Suspended Jobs"
+date: 2021-04-12
+slug: introducing-suspended-jobs
+layout: blog
+---
+
+**Author:** Adhityaa Chandrasekar (Google)
+
+[Jobs](/docs/concepts/workloads/controllers/job/) are a crucial part of
+Kubernetes API. While other kinds of workloads such as [Deployments](/docs/concepts/workloads/controllers/deployment/),
+[ReplicaSets](/docs/concepts/workloads/controllers/replicaset/),
+[StatefulSets](/docs/concepts/workloads/controllers/statefulset/), and
+[DaemonSets](/docs/concepts/workloads/controllers/daemonset/)
+solve use-cases that require Pods to run forever, Jobs are useful when Pods need
+to run to completion. Commonly used in parallel batch processing, Jobs can be
+used in a variety of applications ranging from video rendering and database
+maintenance to sending bulk emails and scientific computing.
+
+While the amount of parallelism and the conditions for Job completion are
+configurable, the Kubernetes API lacked the ability to suspend and resume Jobs.
+This is often desired when cluster resources are limited and a higher priority
+Job needs to execute in the place of another Job. Deleting the lower priority
+Job is a poor workaround as Pod completion history and other metrics associated
+with the Job will be lost.
+
+With the recent Kubernetes 1.21 release, you will be able to suspend a Job by
+updating its spec. The feature is currently in **alpha** and requires you to
+enable the `SuspendJob` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
+on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/)
+and the [controller manager](/docs/reference/command-line-tools-reference/kube-controller-manager/)
+in order to use it.
+
+## API changes
+
+A new boolean field `suspend` is introduced in the Job spec API. Let's say I
+create the following Job:
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: my-job
+spec:
+  suspend: true
+  parallelism: 2
+  completions: 10
+  template:
+    spec:
+      containers:
+      - name: my-container
+        image: busybox
+        command: ["sleep", "5"]
+      restartPolicy: Never
+```
+
+Jobs are not suspended by default, so I'm explicitly setting the `suspend` field
+to true in the above Job spec. In the above example, the Job controller will
+refrain from creating Pods until I'm ready to start the Job, which I can do by
+updating the field to false.
+
+As another example, consider a Job that was created with the `suspend` field
+omitted. The Job controller will happily create Pods to work towards Job
+completion. However, before the Job completes, if I explicitly set the field to
+true with a Job update, the Job controller will terminate all active Pods that
+are running and will wait indefinitely for the flag to be flipped back to false.
+Pod termination is done by sending a SIGTERM signal to all active Pods; the
+[graceful termination period](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
+defined in the Pod spec will be honoured. Pods terminated this way will not be
+counted as failures by the Job controller.
+
+It is important to understand that succeeded and failed Pods from the past will
+continue to exist after you suspend a Job. That is, that they will count towards
+Job completion once you resume it. You can verify this by looking at Job's
+status before and after suspension.
+
+Read the [documentation](/docs/concepts/workloads/controllers/job#suspending-a-job)
+for a full overview of this new feature.
+
+## Where is this useful?
+
+Let's say I'm the operator of a large cluster. I have many users submitting Jobs
+to the cluster, but not all Jobs are created equal — some Jobs are more
+important than others. Cluster resources aren't infinite either, so all users
+must share resources. If all Jobs were created in the suspended state and placed
+in a pending queue, I can achieve priority-based Job scheduling by resuming Jobs
+in the right order.
+
+As another motivational use-case, consider a cloud provider where compute
+resources are cheaper at night than in the morning. If I have a long-running Job
+that takes multiple days to complete, being able to suspend the Job in the
+morning and then resume it in the evening every day can reduce costs.
+
+Since this field is a part of the Job spec, CronJobs automatically get this
+feature for free too.
+
+## References and next steps
+
+If you're interested in a deeper dive into the rationale behind this feature and
+the decisions we have taken, consider reading the [enhancement proposal](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2232-suspend-jobs).
+There's more detail on suspending and resuming jobs in the documentation for [Job](/docs/concepts/workloads/controllers/job#suspending-a-job).
+
+As previously mentioned, this feature is currently in alpha and is available
+only if you explicitly opt-in through the `SuspendJob` feature gate. If this is
+a feature you're interested in, please consider testing suspended Jobs in your
+cluster and providing feedback. You can discuss this enhancement [on GitHub](https://github.com/kubernetes/enhancements/issues/2232).
+The SIG Apps community also [meets regularly](https://github.com/kubernetes/community/tree/master/sig-apps#meetings)
+and can be reached through [Slack or the mailing list](https://github.com/kubernetes/community/tree/master/sig-apps#contact).
+Barring any unexpected changes to the API, we intend to graduate the feature to
+beta in Kubernetes 1.22, so that the feature becomes available by default.