656 lines
30 KiB
Markdown
656 lines
30 KiB
Markdown
---
|
||
title: "My exciting journey into Kubernetes’ history"
|
||
date: 2020-05-28
|
||
slug: kubernetes-history
|
||
url: /blog/2020/05/my-exciting-journey-into-kubernetes-history
|
||
---
|
||
|
||
**Author:** Sascha Grunert, SUSE Software Solutions
|
||
|
||
_Editor's note: Sascha is part of [SIG Release][0] and is working on many other
|
||
different container runtime related topics. Feel free to reach him out on
|
||
Twitter [@saschagrunert][1]._
|
||
|
||
[0]: https://github.com/kubernetes/sig-release
|
||
[1]: https://twitter.com/saschagrunert
|
||
|
||
---
|
||
|
||
> A story of data science-ing 90,000 GitHub issues and pull requests by using
|
||
> Kubeflow, TensorFlow, Prow and a fully automated CI/CD pipeline.
|
||
|
||
- [Introduction](#introduction)
|
||
- [Getting the Data](#getting-the-data)
|
||
- [Exploring the Data](#exploring-the-data)
|
||
- [Labels, Labels, Labels](#labels-labels-labels)
|
||
- [Building the Machine Learning Model](#building-the-machine-learning-model)
|
||
- [Doing some first Natural Language Processing (NLP)](#doing-some-first-natural-language-processing-nlp)
|
||
- [Creating the Multi-Layer Perceptron (MLP) Model](#creating-the-multi-layer-perceptron-mlp-model)
|
||
- [Training the Model](#training-the-model)
|
||
- [A first Prediction](#a-first-prediction)
|
||
- [Automate Everything](#automate-everything)
|
||
- [Automatic Labeling of new PRs](#automatic-labeling-of-new-prs)
|
||
- [Summary](#summary)
|
||
|
||
# Introduction
|
||
|
||
Choosing the right steps when working in the field of data science is truly no
|
||
silver bullet. Most data scientists might have their custom workflow, which
|
||
could be more or less automated, depending on their area of work. Using
|
||
[Kubernetes][10] can be a tremendous enhancement when trying to automate
|
||
workflows on a large scale. In this blog post, I would like to take you on my
|
||
journey of doing data science while integrating the overall workflow into
|
||
Kubernetes.
|
||
|
||
The target of the research I did in the past few months was to find any
|
||
useful information about all those thousands of GitHub issues and pull requests
|
||
(PRs) we have in the [Kubernetes repository][11]. What I ended up with was a
|
||
fully automated, in Kubernetes running Continuous Integration (CI) and
|
||
Deployment (CD) data science workflow powered by [Kubeflow][12] and [Prow][13].
|
||
You may not know both of them, but we get to the point where I explain what
|
||
they’re doing in detail. The source code of my work can be found in the
|
||
[kubernetes-analysis GitHub repository][14], which contains everything source
|
||
code-related as well as the raw data. But how to retrieve this data I’m talking
|
||
about? Well, this is where the story begins.
|
||
|
||
[10]: https://kubernetes.io
|
||
[11]: https://github.com/kubernetes/kubernetes
|
||
[12]: https://www.kubeflow.org
|
||
[13]: https://github.com/kubernetes/test-infra/tree/master/prow
|
||
[14]: https://github.com/kubernetes-analysis/kubernetes-analysis
|
||
|
||
# Getting the Data
|
||
|
||
The foundation for my experiments is the raw GitHub API data in plain [JSON][23]
|
||
format. The necessary data can be retrieved via the [GitHub issues
|
||
endpoint][20], which returns all pull requests as well as regular issues in the
|
||
[REST][21] API. I exported roughly **91000** issues and pull requests in
|
||
the first iteration into a massive **650 MiB** data blob. This took me about **8
|
||
hours** of data retrieval time because for sure, the GitHub API is [rate
|
||
limited][22]. To be able to put this data into a GitHub repository, I’d chosen
|
||
to compress it via [`xz(1)`][24]. The result was a roundabout [25 MiB sized
|
||
tarball][25], which fits well into the repository.
|
||
|
||
[20]: https://developer.github.com/v3/issues
|
||
[21]: https://en.wikipedia.org/wiki/Representational_state_transfer
|
||
[22]: https://developer.github.com/apps/building-github-apps/understanding-rate-limits-for-github-apps/
|
||
[23]: https://en.wikipedia.org/wiki/JSON
|
||
[24]: https://linux.die.net/man/1/xz
|
||
[25]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/master/data/api.tar.xz
|
||
|
||
I had to find a way to regularly update the dataset because the Kubernetes
|
||
issues and pull requests are updated by the users over time as well as new ones
|
||
are created. To achieve the continuous update without having to wait 8 hours
|
||
over and over again, I now fetch the delta GitHub API data between the
|
||
[last update][31] and the current time. This way, a Continuous Integration job
|
||
can update the data on a regular basis, whereas I can continue my research with
|
||
the latest available set of data.
|
||
|
||
From a tooling perspective, I’ve written an [all-in-one Python executable][30],
|
||
which allows us to trigger the different steps during the data science
|
||
experiments separately via dedicated subcommands. For example, to run an export
|
||
of the whole data set, we can call:
|
||
|
||
[30]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/master/main
|
||
[31]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/master/.update
|
||
|
||
```
|
||
> export GITHUB_TOKEN=<MY-SECRET-TOKEN>
|
||
> ./main export
|
||
INFO | Getting GITHUB_TOKEN from environment variable
|
||
INFO | Dumping all issues
|
||
INFO | Pulling 90929 items
|
||
INFO | 1: Unit test coverage in Kubelet is lousy. (~30%)
|
||
INFO | 2: Better error messages if go isn't installed, or if gcloud is old.
|
||
INFO | 3: Need real cluster integration tests
|
||
INFO | 4: kubelet should know which containers it is managing
|
||
… [just wait 8 hours] …
|
||
```
|
||
|
||
To update the data between the last time stamp stored in the repository we can
|
||
run:
|
||
|
||
```
|
||
> ./main export --update-api
|
||
INFO | Getting GITHUB_TOKEN from environment variable
|
||
INFO | Retrieving issues and PRs
|
||
INFO | Updating API
|
||
INFO | Got update timestamp: 2020-05-09T10:57:40.854151
|
||
INFO | 90786: Automated cherry pick of #90749: fix: azure disk dangling attach issue
|
||
INFO | 90674: Switch core master base images from debian to distroless
|
||
INFO | 90086: Handling error returned by request.Request.ParseForm()
|
||
INFO | 90544: configurable weight on the CPU and memory
|
||
INFO | 87746: Support compiling Kubelet w/o docker/docker
|
||
INFO | Using already extracted data from data/data.pickle
|
||
INFO | Loading pickle dataset
|
||
INFO | Parsed 34380 issues and 55832 pull requests (90212 items)
|
||
INFO | Updating data
|
||
INFO | Updating issue 90786 (updated at 2020-05-09T10:59:43Z)
|
||
INFO | Updating issue 90674 (updated at 2020-05-09T10:58:27Z)
|
||
INFO | Updating issue 90086 (updated at 2020-05-09T10:58:26Z)
|
||
INFO | Updating issue 90544 (updated at 2020-05-09T10:57:51Z)
|
||
INFO | Updating issue 87746 (updated at 2020-05-09T11:01:51Z)
|
||
INFO | Saving data
|
||
```
|
||
|
||
This gives us an idea of how fast the project is actually moving: On a Saturday
|
||
at noon (European time), 5 issues and pull requests got updated within literally 5
|
||
minutes!
|
||
|
||
Funnily enough, [Joe Beda][32], one of the founders of Kubernetes, created the
|
||
first GitHub issue [mentioning that the unit test coverage is too low][33]. The
|
||
issue has no further description than the title, and no enhanced labeling
|
||
applied, like we know from more recent issues and pull requests. But now we have
|
||
to explore the exported data more deeply to do something useful with it.
|
||
|
||
[32]: https://github.com/jbeda
|
||
[33]: https://github.com/kubernetes/kubernetes/issues/1
|
||
|
||
# Exploring the Data
|
||
|
||
Before we can start creating machine learning models and train them, we have to
|
||
get an idea about how our data is structured and what we want to achieve in
|
||
general.
|
||
|
||
To get a better feeling about the amount of data, let’s look at how many issues
|
||
and pull requests have been created over time inside the Kubernetes repository:
|
||
|
||
```
|
||
> ./main analyze --created
|
||
INFO | Using already extracted data from data/data.pickle
|
||
INFO | Loading pickle dataset
|
||
INFO | Parsed 34380 issues and 55832 pull requests (90212 items)
|
||
```
|
||
|
||
The Python [matplotlib][40] module should pop up a graph which looks like this:
|
||
|
||

|
||
|
||
[40]: https://matplotlib.org
|
||
|
||
Okay, this looks not that spectacular but gives us an impression on how the
|
||
project has grown over the past 6 years. To get a better idea about the speed of
|
||
development of the project, we can look at the _created-vs-closed_ metric. This
|
||
means on our timeline, we add one to the y-axis if an issue or pull request got
|
||
created and subtracts one if closed. Now the chart looks like this:
|
||
|
||
```
|
||
> ./main analyze --created-vs-closed
|
||
```
|
||
|
||

|
||
|
||
At the beginning of 2018, the Kubernetes projects introduced some more enhanced
|
||
life-cycle management via the glorious [fejta-bot][41]. This automatically
|
||
closes issues and pull requests after they got stale over a longer period of
|
||
time. This resulted in a massive closing of issues, which does not apply to pull
|
||
requests in the same amount. For example, if we look at the _created-vs-closed_
|
||
metric only for pull requests.
|
||
|
||
[41]: https://github.com/fejta-bot
|
||
|
||
```
|
||
> ./main analyze --created-vs-closed --pull-requests
|
||
```
|
||
|
||

|
||
|
||
The overall impact is not that obvious. What we can see is that the increasing
|
||
number of peaks in the PR chart indicates that the project is moving faster over
|
||
time. Usually, a candlestick chart would be a better choice for showing this kind
|
||
of volatility-related information. I’d also like to highlight that it looks like
|
||
the development of the project slowed down a bit in the beginning of 2020.
|
||
|
||
Parsing raw JSON in every analysis iteration is not the fastest approach to do
|
||
in Python. This means that I decided to parse the more important information,
|
||
for example the content, title and creation time into dedicated [issue][50] and
|
||
[PR classes][51]. This data will be [pickle][58] serialized into the repository
|
||
as well, which allows an overall faster startup independently of the JSON blob.
|
||
|
||
A pull request is more or less the same as an issue in my analysis, except that
|
||
it contains a release note.
|
||
|
||
[50]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/master/src/issue.py
|
||
[51]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/master/src/pull_request.py
|
||
[58]: https://docs.python.org/3/library/pickle.html
|
||
|
||
Release notes in Kubernetes are written in the PRs description into a separate
|
||
`release-note` block like this:
|
||
|
||
````
|
||
```release-note
|
||
I changed something extremely important and you should note that.
|
||
```
|
||
````
|
||
|
||
Those release notes are parsed by [dedicated Release Engineering Tools like
|
||
`krel`][52] during the release creation process and will be part of the various
|
||
[CHANGELOG.md][53] files and the [Release Notes Website][54]. That seems like a
|
||
lot of magic, but in the end, the quality of the overall release notes is much
|
||
higher because they’re easy to edit, and the PR reviewers can ensure that we
|
||
only document real user-facing changes and nothing else.
|
||
|
||
[52]: https://github.com/kubernetes/release#tools
|
||
[53]: https://github.com/kubernetes/kubernetes/tree/master/CHANGELOG
|
||
[54]: https://relnotes.k8s.io
|
||
|
||
The quality of the input data is a key aspect when doing data science. I decided
|
||
to focus on the release notes because they seem to have the highest amount of
|
||
overall quality when comparing them to the plain descriptions in issues and PRs.
|
||
Besides that, they’re easy to parse, and we would not need to strip away
|
||
the [various issue][55] and [PR template][56] text noise.
|
||
|
||
[55]: https://github.com/kubernetes/kubernetes/tree/master/.github/ISSUE_TEMPLATE
|
||
[56]: https://github.com/kubernetes/kubernetes/blob/master/.github/PULL_REQUEST_TEMPLATE.md
|
||
|
||
## Labels, Labels, Labels
|
||
|
||
Issues and pull requests in Kubernetes get different labels applied during its
|
||
life-cycle. They are usually grouped via a single slash (`/`). For example, we
|
||
have `kind/bug` and `kind/api-change` or `sig/node` and `sig/network`. An easy
|
||
way to understand which label groups exist and how they’re distributed across
|
||
the repository is to plot them into a bar chart:
|
||
|
||
```
|
||
> ./main analyze --labels-by-group
|
||
```
|
||
|
||

|
||
|
||
It looks like that `sig/`, `kind/` and `area/` labels are pretty common.
|
||
Something like `size/` can be ignored for now because these labels are
|
||
automatically applied based on the amount of the code changes for a pull
|
||
request. We said that we want to focus on release notes as input data, which
|
||
means that we have to check out the distribution of the labels for the PRs. This
|
||
means that the top 25 labels on pull requests are:
|
||
|
||
```
|
||
> ./main analyze --labels-by-name --pull-requests
|
||
```
|
||
|
||

|
||
|
||
Again, we can ignore labels like `lgtm` (looks good to me), because every PR
|
||
which now should get merged has to look good. Pull requests containing release
|
||
notes automatically get the `release-note` label applied, which enables further
|
||
filtering more easily. This does not mean that every PR containing that label
|
||
also contains the release notes block. The label could have been applied
|
||
manually and the parsing of the release notes block did not exist since the
|
||
beginning of the project. This means we will probably loose a decent amount of
|
||
input data on one hand. On the other hand we can focus on the highest possible
|
||
data quality, because applying labels the right way needs some enhanced maturity
|
||
of the project and its contributors.
|
||
|
||
From a label group perspective I have chosen to focus on the `kind/` labels.
|
||
Those labels are something which has to be applied manually by the author of the
|
||
PR, they are available on a good amount of pull requests and they’re related to
|
||
user-facing changes as well. Besides that, the `kind/` choice has to be done for
|
||
every pull request because it is part of the PR template.
|
||
|
||
Alright, how does the distribution of those labels look like when focusing only
|
||
on pull requests which have release notes?
|
||
|
||
```
|
||
> ./main analyze --release-notes-stats
|
||
```
|
||
|
||

|
||
|
||
Interestingly, we have approximately 7,000 overall pull requests containing
|
||
release notes, but only ~5,000 have a `kind/` label applied. The distribution of
|
||
the labels is not equal, and one-third of them are labeled as `kind/bug`. This
|
||
brings me to the next decision in my data science journey: I will build a binary
|
||
classifier which, for the sake of simplicity, is only able to distinguish between
|
||
bugs (via `kind/bug`) and non-bugs (where the label is not applied).
|
||
|
||
The main target is now to be able to classify newly incoming release notes if
|
||
they are related to a bug or not, based on the historical data we already have
|
||
from the community.
|
||
|
||
Before doing that, I recommend that you play around with the `./main analyze -h`
|
||
subcommand as well to explore the latest set of data. You can also check out all
|
||
the [continuously updated assets][57] I provide within the analysis repository.
|
||
For example, those are the top 25 PR creators inside the Kubernetes repository:
|
||
|
||

|
||
|
||
[57]: https://github.com/kubernetes-analysis/kubernetes-analysis/tree/master/assets
|
||
|
||
# Building the Machine Learning Model
|
||
|
||
Now we have an idea what the data set is about, and we can start building a first
|
||
machine learning model. Before actually building the model, we have to
|
||
pre-process all the extracted release notes from the PRs. Otherwise, the model
|
||
would not be able to understand our input.
|
||
|
||
## Doing some first Natural Language Processing (NLP)
|
||
|
||
In the beginning, we have to define a vocabulary for which we want to train. I
|
||
decided to choose the [TfidfVectorizer][60] from the Python scikit-learn machine
|
||
learning library. This vectorizer is able to take our input texts and create a
|
||
single huge vocabulary out of it. This is our so-called [bag-of-words][61],
|
||
which has a chosen n-gram range of `(1, 2)` (unigrams and bigrams). Practically
|
||
this means that we always use the first word and the next one as a single
|
||
vocabulary entry (bigrams). We also use the single word as vocabulary entry
|
||
(unigram). The TfidfVectorizer is able to skip words that occur multiple times
|
||
(`max_df`), and requires a minimum amount (`min_df`) to add a word to the
|
||
vocabulary. I decided not to change those values in the first place, just
|
||
because I had the intuition that release notes are something unique to a
|
||
project.
|
||
|
||
Parameters like `min_df`, `max_df` and the n-gram range can be seen as some of
|
||
our hyperparameters. Those parameters have to be optimized in a dedicated step
|
||
after the machine learning model has been built. This step is called
|
||
hyperparameter tuning and basically means that we train multiple times with
|
||
different parameters and compare the accuracy of the model. Afterwards, we choose
|
||
the parameters with the best accuracy.
|
||
|
||
[60]: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
|
||
[61]: https://en.wikipedia.org/wiki/Bag-of-words_model
|
||
|
||
During the training, the vectorizer will produce a `data/features.json` which
|
||
contains the whole vocabulary. This gives us a good understanding of how such a
|
||
vocabulary may look like:
|
||
|
||
```json
|
||
[
|
||
…
|
||
"hostname",
|
||
"hostname address",
|
||
"hostname and",
|
||
"hostname as",
|
||
"hostname being",
|
||
"hostname bug",
|
||
…
|
||
]
|
||
```
|
||
|
||
This produces round about 50,000 entries in the overall bag-of-words, which is
|
||
pretty much. Previous analyses between different data sets showed that it is
|
||
simply not necessary to take so many features into account. Some general data
|
||
sets state that an overall vocabulary of 20,000 is enough and higher amounts do
|
||
not influence the accuracy any more. To do so we can use the [SelectKBest][62]
|
||
feature selector to strip down the vocabulary to only choose the top features.
|
||
Anyway, I still decided to stick to the top 50,000 to not negatively influence
|
||
the model accuracy. We have a relatively low amount of data (appr. 7,000
|
||
samples) and a low number of words per sample (~15) which already made me wonder
|
||
if we have enough data at all.
|
||
|
||
[62]: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
|
||
|
||
The vectorizer is not only able to create our bag-of-words, but it is also able to
|
||
encode the features in [term frequency–inverse document frequency (tf-idf)][63]
|
||
format. That is where the vectorizer got its name, whereas the output of that
|
||
encoding is something the machine learning model can directly consume. All the
|
||
details of the vectorization process can be found in the [source code][64].
|
||
|
||
[63]: https://en.wikipedia.org/wiki/Tf%e2%80%93idf
|
||
[64]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/f419ff4a3462bafc0cb067aa6973dc7280409699/src/nlp.py#L193-L235
|
||
|
||
## Creating the Multi-Layer Perceptron (MLP) Model
|
||
|
||
I decided to choose a simple MLP based model which is built with the help of the
|
||
popular [TensorFlow][70] framework. Because we do not have that much input data,
|
||
we just use two hidden layers, so that the model basically looks like this:
|
||
|
||

|
||
|
||
[70]: https://www.tensorflow.org/api_docs/python/tf/keras
|
||
|
||
There have to be [multiple other][71] hyperparameters to be taken into account
|
||
when creating the model. I will not discuss them in detail here, but they’re
|
||
important to be optimized also in relation to the number of classes we want to
|
||
have in the model (only two in our case).
|
||
|
||
[71]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/f419ff4a3462bafc0cb067aa6973dc7280409699/src/nlp.py#L95-L100
|
||
|
||
## Training the Model
|
||
|
||
Before starting the actual training, we have to split up our input data into
|
||
training and validation data sets. I’ve chosen to use ~80% of the data for
|
||
training and 20% for validation purposes. We have to shuffle our input data as
|
||
well to ensure that the model is not affected by ordering issues. The technical
|
||
details of the training process can be found in the [GitHub sources][80]. So now
|
||
we’re ready to finally start the training:
|
||
|
||
```
|
||
> ./main train
|
||
INFO | Using already extracted data from data/data.pickle
|
||
INFO | Loading pickle dataset
|
||
INFO | Parsed 34380 issues and 55832 pull requests (90212 items)
|
||
INFO | Training for label 'kind/bug'
|
||
INFO | 6980 items selected
|
||
INFO | Using 5584 training and 1395 testing texts
|
||
INFO | Number of classes: 2
|
||
INFO | Vocabulary len: 51772
|
||
INFO | Wrote features to file data/features.json
|
||
INFO | Using units: 1
|
||
INFO | Using activation function: sigmoid
|
||
INFO | Created model with 2 layers and 64 units
|
||
INFO | Compiling model
|
||
INFO | Starting training
|
||
Train on 5584 samples, validate on 1395 samples
|
||
Epoch 1/1000
|
||
5584/5584 - 3s - loss: 0.6895 - acc: 0.6789 - val_loss: 0.6856 - val_acc: 0.6860
|
||
Epoch 2/1000
|
||
5584/5584 - 2s - loss: 0.6822 - acc: 0.6827 - val_loss: 0.6782 - val_acc: 0.6860
|
||
Epoch 3/1000
|
||
…
|
||
Epoch 68/1000
|
||
5584/5584 - 2s - loss: 0.2587 - acc: 0.9257 - val_loss: 0.4847 - val_acc: 0.7728
|
||
INFO | Confusion matrix:
|
||
[[920 32]
|
||
[291 152]]
|
||
INFO | Confusion matrix normalized:
|
||
[[0.966 0.034]
|
||
[0.657 0.343]]
|
||
INFO | Saving model to file data/model.h5
|
||
INFO | Validation accuracy: 0.7727598547935486, loss: 0.48470408514836355
|
||
```
|
||
|
||
The output of the [Confusion Matrix][81] shows us that we’re pretty good on
|
||
training accuracy, but the validation accuracy could be a bit higher. We now
|
||
could start a hyperparameter tuning to see if we can optimize the output of the
|
||
model even further. I will leave that experiment up to you with the hint to the
|
||
`./main train --tune` flag.
|
||
|
||
We saved the model (`data/model.h5`), the vectorizer (`data/vectorizer.pickle`)
|
||
and the feature selector (`data/selector.pickle`) to disk to be able to use them
|
||
later on for prediction purposes without having a need for additional training
|
||
steps.
|
||
|
||
[80]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/f419ff4a3462bafc0cb067aa6973dc7280409699/src/nlp.py#L91-L170
|
||
[81]: https://en.wikipedia.org/wiki/Confusion_matrix
|
||
|
||
## A first Prediction
|
||
|
||
We are now able to test the model by loading it from disk and predicting some
|
||
input text:
|
||
|
||
```
|
||
> ./main predict --test
|
||
INFO | Testing positive text:
|
||
|
||
Fix concurrent map access panic
|
||
Don't watch .mount cgroups to reduce number of inotify watches
|
||
Fix NVML initialization race condition
|
||
Fix brtfs disk metrics when using a subdirectory of a subvolume
|
||
|
||
INFO | Got prediction result: 0.9940581321716309
|
||
INFO | Matched expected positive prediction result
|
||
INFO | Testing negative text:
|
||
|
||
action required
|
||
1. Currently, if users were to explicitly specify CacheSize of 0 for
|
||
KMS provider, they would end-up with a provider that caches up to
|
||
1000 keys. This PR changes this behavior.
|
||
Post this PR, when users supply 0 for CacheSize this will result in
|
||
a validation error.
|
||
2. CacheSize type was changed from int32 to *int32. This allows
|
||
defaulting logic to differentiate between cases where users
|
||
explicitly supplied 0 vs. not supplied any value.
|
||
3. KMS Provider's endpoint (path to Unix socket) is now validated when
|
||
the EncryptionConfiguration files is loaded. This used to be handled
|
||
by the GRPCService.
|
||
|
||
INFO | Got prediction result: 0.1251964420080185
|
||
INFO | Matched expected negative prediction result
|
||
```
|
||
|
||
Both tests are real-world examples which already exist. We could also try
|
||
something completely different, like this random tweet I found a couple of
|
||
minutes ago:
|
||
|
||
```
|
||
./main predict "My dudes, if you can understand SYN-ACK, you can understand consent"
|
||
INFO | Got prediction result: 0.1251964420080185
|
||
ERROR | Result is lower than selected threshold 0.6
|
||
```
|
||
|
||
Looks like it is not classified as bug for a release note, which seems to work.
|
||
Selecting a good threshold is also not that easy, but sticking to something >
|
||
50% should be the bare minimum.
|
||
|
||
# Automate Everything
|
||
|
||
The next step is to find some way of automation to continuously update the model
|
||
with new data. If I change any source code within my repository, then I’d like
|
||
to get feedback about the test results of the model without having a need to run
|
||
the training on my own machine. I would like to utilize the GPUs in my
|
||
Kubernetes cluster to train faster and automatically update the data set if a PR
|
||
got merged.
|
||
|
||
With the help of [Kubeflow pipelines][90] we can fulfill most of these
|
||
requirements. The pipeline I built looks like this:
|
||
|
||
[90]: https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview
|
||
|
||

|
||
|
||
First, we check out the source code of the PR, which will be passed on as output
|
||
artifact to all other steps. Then we incrementally update the API and internal
|
||
data before we run the training on an always up-to-date data set. The prediction
|
||
test verifies after the training that we did not badly influence the model with
|
||
our changes.
|
||
|
||
We also built a container image within our pipeline. [This container image][91]
|
||
copies the previously built model, vectorizer, and selector into a container and
|
||
runs `./main serve`. When doing this, we spin up a [kfserving][92] web server,
|
||
which can be used for prediction purposes. Do you want to try it out by yourself? Simply
|
||
do a JSON POST request like this and run the prediction against the endpoint:
|
||
|
||
```
|
||
> curl https://kfserving.k8s.saschagrunert.de/v1/models/kubernetes-analysis:predict \
|
||
-d '{"text": "my test text"}'
|
||
{"result": 0.1251964420080185}
|
||
```
|
||
|
||
The [custom kfserving][93] implementation is pretty straightforward, whereas the
|
||
deployment utilizes [Knative Serving][95] and an [Istio][94] ingress gateway
|
||
under the hood to correctly route the traffic into the cluster and provide the
|
||
right set of services.
|
||
|
||
[91]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/master/Dockerfile-deploy
|
||
[92]: https://www.kubeflow.org/docs/components/serving/kfserving
|
||
[93]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/master/src/kfserver.py
|
||
[94]: https://istio.io
|
||
[95]: https://knative.dev/docs/serving
|
||
|
||
The `commit-changes` and `rollout` step will only run if the pipeline runs on
|
||
the `master` branch. Those steps make sure that we always have the latest data
|
||
set available on the master branch as well as in the kfserving deployment. The
|
||
[rollout step][96] creates a new canary deployment, which only accepts 50% of the
|
||
incoming traffic in the first place. After the canary got deployed successfully,
|
||
it will be promoted as the new main instance of the service. This is a great way
|
||
to ensure that the deployment works as intended and allows additional testing
|
||
after rolling out the canary.
|
||
|
||
[96]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/f419ff4a3462bafc0cb067aa6973dc7280409699/src/rollout.py#L30-L51
|
||
|
||
But how to trigger Kubeflow pipelines when creating a pull request? Kubeflow has
|
||
no feature for that right now. That’s why I decided to use [Prow][100],
|
||
Kubernetes test-infrastructure project for CI/CD purposes.
|
||
|
||
First of all, a [24h periodic job][101] ensures that we have at least daily
|
||
up-to-date data available within the repository. Then, if we create a pull
|
||
request, Prow will run the whole Kubeflow pipeline without committing or rolling
|
||
out any changes. If we merge the pull request via Prow, another job runs on the
|
||
master and updates the data as well as the deployment. That’s pretty neat, isn’t
|
||
it?
|
||
|
||
[100]: https://github.com/kubernetes/test-infra/tree/master/prow
|
||
[101]: https://github.com/kubernetes-analysis/kubernetes-analysis/blob/f419ff4a3462bafc0cb067aa6973dc7280409699/ci/config.yaml#L45-L61
|
||
|
||
# Automatic Labeling of new PRs
|
||
|
||
The prediction API is nice for testing, but now we need a real-world use case.
|
||
Prow supports external plugins which can be used to take action on any GitHub
|
||
event. I wrote [a plugin][110] which uses the kfserving API to make predictions
|
||
based on new pull requests. This means if we now create a new pull request in
|
||
the kubernetes-analysis repository, we will see the following:
|
||
|
||
[110]: https://github.com/kubernetes-analysis/kubernetes-analysis/tree/master/pkg
|
||
|
||

|
||
|
||
---
|
||
|
||

|
||
|
||
Okay cool, so now let’s change the release note based on a real bug from the
|
||
already existing dataset:
|
||
|
||

|
||
|
||
---
|
||
|
||

|
||
|
||
The bot edits its own comment, predicts it with round about 90% as `kind/bug`
|
||
and automatically adds the correct label! Now, if we change it back to some
|
||
different - obviously wrong - release note:
|
||
|
||

|
||
|
||
---
|
||
|
||

|
||
|
||
The bot does the work for us, removes the label and informs us what it did!
|
||
Finally, if we change the release note to `None`:
|
||
|
||

|
||
|
||
---
|
||
|
||

|
||
|
||
The bot removed the comment, which is nice and reduces the text noise on the PR.
|
||
Everything I demonstrated is running inside a single Kubernetes cluster, which
|
||
would make it unnecessary at all to expose the kfserving API to the public. This
|
||
introduces an indirect API rate limiting because the only usage would be
|
||
possible via the Prow bot user.
|
||
|
||
If you want to try it out for yourself, feel free to open a [new test
|
||
issue][111] in `kubernetes-analysis`. This works because I enabled the plugin
|
||
also for issues rather than only for pull requests.
|
||
|
||
[111]: https://github.com/kubernetes-analysis/kubernetes-analysis/issues/new?&template=release-notes-test.md
|
||
|
||
So then, we have a running CI bot which is able to classify new release notes
|
||
based on a machine learning model. If the bot would run in the official
|
||
Kubernetes repository, then we could correct wrong label predictions manually.
|
||
This way, the next training iteration would pick up the correction and result in
|
||
a continuously improved model over time. All totally automated!
|
||
|
||
# Summary
|
||
|
||
Thank you for reading down to here! This was my little data science journey
|
||
through the Kubernetes GitHub repository. There are a lot of other things to
|
||
optimize, for example introducing more classes (than just `kind/bug` or nothing)
|
||
or automatic hyperparameter tuning with Kubeflows [Katib][120]. If you have any
|
||
questions or suggestions, then feel free to get in touch with me anytime. See you
|
||
soon!
|
||
|
||
[120]: https://www.kubeflow.org/docs/components/hyperparameter-tuning/hyperparameter
|