Clean up Job parallel processing expansion task

- use Task template, not Concept
- Explain use of curl for downloading
- Use DBPedia URLs for fruit (these should stay valid)
- Reword prerequisites
pull/19999/head
Tim Bannister 2019-12-10 01:35:17 +00:00 committed by bryan
parent 05b55bfaa4
commit 3e8f991640
1 changed files with 164 additions and 73 deletions

View File

@ -1,52 +1,70 @@
---
title: Parallel Processing using Expansions
content_template: templates/concept
content_template: templates/task
min-kubernetes-server-version: v1.8
weight: 20
---
{{% capture overview %}}
In this example, we will run multiple Kubernetes Jobs created from
a common template. You may want to be familiar with the basic,
non-parallel, use of [Jobs](/docs/concepts/workloads/controllers/jobs-run-to-completion/) first.
This task demonstrates running multiple {{< glossary_tooltip text="Jobs" term_id="job" >}}
based on a common template. You can use this approach to process batches of work in
parallel.
For this example there are only three items: _apple_, _banana_, and _cherry_.
The sample Jobs process each item simply by printing a string then pausing.
See [using Jobs in real workloads](#using-jobs-in-real-workloads) to learn about how
this pattern fits more realistic use cases.
{{% /capture %}}
{{% capture prerequisites %}}
You should be familiar with the basic,
non-parallel, use of [Job](/docs/concepts/jobs/run-to-completion-finite-workloads/).
{{< include "task-tutorial-prereqs.md" >}}
For basic templating you need the command-line utility `sed`.
To follow the advanced templating example, you need a working installation of
[Python](https://www.python.org/), and the Jinja2 template
library for Python.
Once you have Python set up, you can install Jinja2 by running:
```shell
pip install --user jinja2
```
{{% /capture %}}
{{% capture body %}}
{{% capture steps %}}
## Basic Template Expansion
## Create Jobs based on a template
First, download the following template of a job to a file called `job-tmpl.yaml`
First, download the following template of a Job to a file called `job-tmpl.yaml`.
Here's what you'll download:
{{< codenew file="application/job/job-tmpl.yaml" >}}
Unlike a *pod template*, our *job template* is not a Kubernetes API type. It is just
a yaml representation of a Job object that has some placeholders that need to be filled
in before it can be used. The `$ITEM` syntax is not meaningful to Kubernetes.
```shell
# Use curl to download job-tmpl.yaml
curl -L -s -O https://k8s.io/examples/application/job/job-tmpl.yaml
```
In this example, the only processing the container does is to `echo` a string and sleep for a bit.
In a real use case, the processing would be some substantial computation, such as rendering a frame
of a movie, or processing a range of rows in a database. The `$ITEM` parameter would specify for
example, the frame number or the row range.
The file you downloaded is not yet a valid Kubernetes
{{< glossary_tooltip text="manifest" term_id="manifest" >}}.
Instead that template is a YAML representation of a Job object with some placeholders
that need to be filled in before it can be used. The `$ITEM` syntax is not meaningful to Kubernetes.
This Job and its Pod template have a label: `jobgroup=jobexample`. There is nothing special
to the system about this label. This label
makes it convenient to operate on all the jobs in this group at once.
We also put the same label on the pod template so that we can check on all Pods of these Jobs
with a single command.
After the job is created, the system will add more labels that distinguish one Job's pods
from another Job's pods.
Note that the label key `jobgroup` is not special to Kubernetes. You can pick your own label scheme.
Next, expand the template into multiple files, one for each item to be processed.
### Create manifests from the template
The following shell snippet uses `sed` to replace the string `$ITEM` with the loop
variable, writing into a temporary directory named `jobs`. Run this now:
```shell
# Download job-templ.yaml
curl -L -s -O https://k8s.io/examples/application/job/job-tmpl.yaml
# Expand files into a temporary directory
# Expand the template into multiple files, one for each item to be processed.
mkdir ./jobs
for i in apple banana cherry
do
@ -68,11 +86,12 @@ job-banana.yaml
job-cherry.yaml
```
Here, we used `sed` to replace the string `$ITEM` with the loop variable.
You could use any type of template language (jinja2, erb) or write a program
to generate the Job objects.
You could use any type of template language (for example: Jinja2; ERB), or
write a program to generate the Job manifests.
Next, create all the jobs with one kubectl command:
### Create Jobs from the manifests
Next, create all the Jobs with one kubectl command:
```shell
kubectl create -f ./jobs
@ -96,22 +115,23 @@ The output is similar to this:
```
NAME COMPLETIONS DURATION AGE
process-item-apple 1/1 14s 20s
process-item-banana 1/1 12s 20s
process-item-apple 1/1 14s 22s
process-item-banana 1/1 12s 21s
process-item-cherry 1/1 12s 20s
```
Here we use the `-l` option to select all jobs that are part of this
group of jobs. (There might be other unrelated jobs in the system that we
do not care to see.)
Using the `-l` option to kubectl selects only the Jobs that are part
of this group of jobs (there might be other unrelated jobs in the system).
You can check on the Pods as well using the same
{{< glossary_tooltip text="label selector" term_id="selector" >}}:
We can check on the pods as well using the same label selector:
```shell
kubectl get pods -l jobgroup=jobexample
```
The output is similar to this:
The output is similar to:
```
NAME READY STATUS RESTARTS AGE
@ -126,7 +146,7 @@ We can use this single command to check on the output of all jobs at once:
kubectl logs -f -l jobgroup=jobexample
```
The output is:
The output should be:
```
Processing item apple
@ -134,26 +154,40 @@ Processing item banana
Processing item cherry
```
## Multiple Template Parameters
### Clean up {#cleanup-1}
In the first example, each instance of the template had one parameter, and that parameter was also
used as a label. However label keys are limited in [what characters they can
contain](/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set).
```shell
# Remove the Jobs you created
# Your cluster automatically cleans up their Pods
kubectl delete job -l jobgroup=jobexample
```
This slightly more complex example uses the jinja2 template language to generate our objects.
We will use a one-line python script to convert the template to a file.
## Use advanced template parameters
In the [first example](#create-jobs-based-on-a-template), each instance of the template had one
parameter, and that parameter was also used in the Job's name. However,
[names](/docs/concepts/overview/working-with-objects/names/#names) are restricted
to contain only certain characters.
This slightly more complex example uses the
[Jinja template language](https://palletsprojects.com/p/jinja/) to generate manifests
and then objects from those manifests, with a multiple parameters for each Job.
For this part of the task, you are going to use a one-line Python script to
convert the template to a set of manifests.
First, copy and paste the following template of a Job object, into a file called `job.yaml.jinja2`:
```liquid
{%- set params = [{ "name": "apple", "url": "https://www.orangepippin.com/varieties/apples", },
{ "name": "banana", "url": "https://en.wikipedia.org/wiki/Banana", },
{ "name": "raspberry", "url": "https://www.raspberrypi.org/" }]
{%- set params = [{ "name": "apple", "url": "http://dbpedia.org/resource/Apple", },
{ "name": "banana", "url": "http://dbpedia.org/resource/Banana", },
{ "name": "cherry", "url": "http://dbpedia.org/resource/Cherry" }]
%}
{%- for p in params %}
{%- set name = p["name"] %}
{%- set url = p["url"] %}
---
apiVersion: batch/v1
kind: Job
metadata:
@ -172,51 +206,108 @@ spec:
image: busybox
command: ["sh", "-c", "echo Processing URL {{ url }} && sleep 5"]
restartPolicy: Never
---
{%- endfor %}
```
The above template defines parameters for each job object using a list of
python dicts (lines 1-4). Then a for loop emits one job yaml object
for each set of parameters (remaining lines).
We take advantage of the fact that multiple yaml documents can be concatenated
with the `---` separator (second to last line).
.) We can pipe the output directly to kubectl to
create the objects.
The above template defines two parameters for each Job object using a list of
python dicts (lines 1-4). A `for` loop emits one Job manifest for each
set of parameters (remaining lines).
You will need the jinja2 package if you do not already have it: `pip install --user jinja2`.
Now, use this one-line python program to expand the template:
This example relies on a feature of YAML. One YAML file can contain multiple
documents (Kubernetes manifests, in this case), separated by `---` on a line
by itself.
You can pipe the output directly to `kubectl` to create the Jobs.
Next, use this one-line Python program to expand the template:
```shell
alias render_template='python -c "from jinja2 import Template; import sys; print(Template(sys.stdin.read()).render());"'
```
The output can be saved to a file, like this:
Use `render_template` to convert the parameters and template into a single
YAML file containing Kubernetes manifests:
```shell
# This requires the alias you defined earlier
cat job.yaml.jinja2 | render_template > jobs.yaml
```
Or sent directly to kubectl, like this:
You can view `jobs.yaml` to verify that the `render_template` script worked
correctly.
Once you are happy that `render_template` is working how you intend,
you can pipe its output into `kubectl`:
```shell
cat job.yaml.jinja2 | render_template | kubectl apply -f -
```
## Alternatives
Kubernetes accepts and runs the Jobs you created.
If you have a large number of job objects, you may find that:
### Clean up {#cleanup-2}
- Even using labels, managing so many Job objects is cumbersome.
- You exceed resource quota when creating all the Jobs at once,
and do not want to wait to create them incrementally.
- Very large numbers of jobs created at once overload the
Kubernetes apiserver, controller, or scheduler.
In this case, you can consider one of the
other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns).
```shell
# Remove the Jobs you created
# Your cluster automatically cleans up their Pods
kubectl delete job -l jobgroup=jobexample
```
{{% /capture %}}
{{% capture discussion %}}
## Using Jobs in real workloads
In a real use case, each Job performs some substantial computation, such as rendering a frame
of a movie, or processing a range of rows in a database. If you were rendering a movie
you would set `$ITEM` to the frame number. If you were processing rows from a database
table, you would set `$ITEM` to represent the range of database rows to process.
In the task, you ran a command to collect the output from Pods by fetching
their logs. In a real use case, each Pod for a Job writes its output to
durable storage before completing. You can use a PersistentVolume for each Job,
or an external storage service. For example, if you are rendering frames for a movie,
use HTTP to `PUT` the rendered frame data to a URL, using a different URL for each
frame.
## Labels on Jobs and Pods
After you create a Job, Kubernetes automatically adds additional
{{< glossary_tooltip text="labels" term_id="label" >}} that
distinguish one Job's pods from another Job's pods.
In this example, each Job and its Pod template have a label:
`jobgroup=jobexample`.
Kubernetes itself pays no attention to labels named `jobgroup`. Setting a label
for all the Jobs you create from a template makes it convenient to operate on all
those Jobs at once.
In the [first example](#create-jobs-based-on-a-template) you used a template to
create several Jobs. The template ensures that each Pod also gets the same label, so
you can check on all Pods for these templated Jobs with a single command.
{{< note >}}
The label key `jobgroup` is not special or reserved.
You can pick your own labelling scheme.
There are [recommended labels](/docs/concepts/overview/working-with-objects/common-labels/#labels)
that you can use if you wish.
{{< /note >}}
## Alternatives
If you plan to create a large number of Job objects, you may find that:
- Even using labels, managing so many Jobs is cumbersome.
- If you create many Jobs in a batch, you might place high load
on the Kubernetes control plane. Alternatively, the Kubernetes API
server could rate limit you, temporarily rejecting your requests with a 429 status.
- You are limited by a {{< glossary_tooltip text="resource quota" term_id="resource-quota" >}}
on Jobs: the API server permanently rejects some of your requests
when you create a great deal of work in one batch.
There are other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns)
that you can use to process large amounts of work without creating very many Job
objects.
You could also consider writing your own [controller](/docs/concepts/architecture/controller/)
to manage Job objects automatically.
{{% /capture %}}