468 lines
11 KiB
Markdown
468 lines
11 KiB
Markdown
---
|
|
title: "Minikube AI playground on Apple silicon"
|
|
linkTitle: "Minikube AI playground on Apple silicon"
|
|
weight: 1
|
|
date: 2024-10-04
|
|
---
|
|
|
|
This tutorial shows how to create an AI playground with minikube on Apple
|
|
silicon devices such as a MacBook Pro. We'll create a cluster that shares your
|
|
Mac's GPU using the krunkit driver, deploy two large language models, and
|
|
interact with the models using Open WebUI.
|
|
|
|

|
|
|
|
## Prerequisites
|
|
|
|
- Apple silicon Mac
|
|
- [krunkit](https://github.com/containers/krunkit) v1.0.0 or later
|
|
- [vmnet-helper](https://github.com/nirs/vmnet-helper) v0.6.0 or later
|
|
- [generic-device-plugin](https://github.com/squat/generic-device-plugin)
|
|
- minikube v1.37.0 or later (krunkit driver only)
|
|
|
|
## Installing krunkit and vmnet-helper
|
|
|
|
Install latest krunkit:
|
|
|
|
```shell
|
|
brew tap slp/krunkit
|
|
brew install krunkit
|
|
krunkit --version
|
|
```
|
|
|
|
Instal latest vmnet-helper:
|
|
|
|
```shell
|
|
curl -fsSL https://github.com/minikube-machine/vmnet-helper/releases/latest/download/install.sh | bash
|
|
/opt/vmnet-helper/bin/vmnet-helper --version
|
|
```
|
|
|
|
For more information, see the [krunkit driver](https://minikube.sigs.k8s.io/docs/drivers/krunkit/)
|
|
documentation.
|
|
|
|
## Download models
|
|
|
|
Download some models to the local disk. By keeping the models outside of
|
|
minikube, you can create and delete clusters quickly without downloading the
|
|
models again.
|
|
|
|
```shell
|
|
mkdir ~/models
|
|
cd ~/models
|
|
curl -LO 'https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf?download=true'
|
|
curl -LO 'https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q8_0.gguf?download=true'
|
|
```
|
|
|
|
**Important**: The model must be in *GGUF* format.
|
|
|
|
## Start minikube
|
|
|
|
Start a minikube cluster with the krunkit driver, mounting the `~/models`
|
|
directory at `/mnt/models`:
|
|
|
|
```shell
|
|
minikube start --driver krunkit --mount-string ~/models:/mnt/models
|
|
```
|
|
|
|
Output:
|
|
```
|
|
😄 minikube v1.37.0 on Darwin 15.6.1 (arm64)
|
|
✨ Using the krunkit (experimental) driver based on user configuration
|
|
👍 Starting "minikube" primary control-plane node in "minikube" cluster
|
|
🔥 Creating krunkit VM (CPUs=2, Memory=6144MB, Disk=20000MB) ...
|
|
🐳 Preparing Kubernetes v1.34.0 on Docker 28.4.0 ...
|
|
🔗 Configuring bridge CNI (Container Networking Interface) ...
|
|
🔎 Verifying Kubernetes components...
|
|
▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
|
|
🌟 Enabled addons: storage-provisioner, default-storageclass
|
|
🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
|
|
```
|
|
|
|
### Verifying that the GPU is available
|
|
|
|
The krunkit driver exposes your host GPU as a virtio-gpu device:
|
|
|
|
```
|
|
% minikube ssh -- tree /dev/dri
|
|
/dev/dri
|
|
|-- by-path
|
|
| |-- platform-a007000.virtio_mmio-card -> ../card0
|
|
| `-- platform-a007000.virtio_mmio-render -> ../renderD128
|
|
|-- card0
|
|
`-- renderD128
|
|
```
|
|
|
|
## Deploying the generic-device-plugin
|
|
|
|
To use the GPU in pods, we need the generic-device-plugin. Deploy it with:
|
|
|
|
```shell
|
|
cat <<'EOF' | kubectl apply -f -
|
|
apiVersion: apps/v1
|
|
kind: DaemonSet
|
|
metadata:
|
|
name: generic-device-plugin
|
|
namespace: kube-system
|
|
labels:
|
|
app.kubernetes.io/name: generic-device-plugin
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: generic-device-plugin
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app.kubernetes.io/name: generic-device-plugin
|
|
spec:
|
|
priorityClassName: system-node-critical
|
|
tolerations:
|
|
- operator: "Exists"
|
|
effect: "NoExecute"
|
|
- operator: "Exists"
|
|
effect: "NoSchedule"
|
|
containers:
|
|
- image: squat/generic-device-plugin
|
|
args:
|
|
- --device
|
|
- |
|
|
name: dri
|
|
groups:
|
|
- count: 4
|
|
paths:
|
|
- path: /dev/dri
|
|
name: generic-device-plugin
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 10Mi
|
|
limits:
|
|
cpu: 50m
|
|
memory: 20Mi
|
|
ports:
|
|
- containerPort: 8080
|
|
name: http
|
|
securityContext:
|
|
privileged: true
|
|
volumeMounts:
|
|
- name: device-plugin
|
|
mountPath: /var/lib/kubelet/device-plugins
|
|
- name: dev
|
|
mountPath: /dev
|
|
volumes:
|
|
- name: device-plugin
|
|
hostPath:
|
|
path: /var/lib/kubelet/device-plugins
|
|
- name: dev
|
|
hostPath:
|
|
path: /dev
|
|
updateStrategy:
|
|
type: RollingUpdate
|
|
EOF
|
|
```
|
|
|
|
**Note**: This configuration allows up to 4 pods to use `/dev/dri`. You can
|
|
increase `count` to run more pods using the GPU.
|
|
|
|
Wait until the generic-device-plugin DaemonSet is available:
|
|
|
|
```shell
|
|
% kubectl get daemonset generic-device-plugin -n kube-system -w
|
|
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
|
|
generic-device-plugin 1 1 1 1 1 <none> 45s
|
|
```
|
|
|
|
## Deploying the granite model
|
|
|
|
To play with the granite model you downloaded, start a llama-server pod serving
|
|
the model and a service to make the pod available to other pods.
|
|
|
|
```shell
|
|
cat <<'EOF' | kubectl apply -f -
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: granite
|
|
spec:
|
|
replicas: 1
|
|
selector:
|
|
matchLabels:
|
|
app: granite
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: granite
|
|
name: granite
|
|
spec:
|
|
containers:
|
|
- name: llama-server
|
|
image: quay.io/ramalama/ramalama:latest
|
|
command: [
|
|
llama-server,
|
|
--host, "0.0.0.0",
|
|
--port, "8080",
|
|
--model, /mnt/models/granite-7b-lab-Q4_K_M.gguf,
|
|
--alias, "ibm/granite:7b",
|
|
--ctx-size, "2048",
|
|
--temp, "0.8",
|
|
--cache-reuse, "256",
|
|
-ngl, "999",
|
|
--threads, "6",
|
|
--no-warmup,
|
|
--log-colors, auto,
|
|
]
|
|
resources:
|
|
limits:
|
|
squat.ai/dri: 1
|
|
volumeMounts:
|
|
- name: models
|
|
mountPath: /mnt/models
|
|
volumes:
|
|
- name: models
|
|
hostPath:
|
|
path: /mnt/models
|
|
---
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
labels:
|
|
app: granite
|
|
name: granite
|
|
spec:
|
|
ports:
|
|
- protocol: TCP
|
|
port: 8080
|
|
selector:
|
|
app: granite
|
|
EOF
|
|
```
|
|
|
|
Wait until the deployment is available:
|
|
|
|
```shell
|
|
% kubectl get deploy granite
|
|
NAME READY UP-TO-DATE AVAILABLE AGE
|
|
granite 1/1 1 1 8m17s
|
|
```
|
|
|
|
Check the granite service:
|
|
|
|
```shell
|
|
% kubectl get service granite
|
|
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
|
granite ClusterIP 10.105.145.9 <none> 8080/TCP 28m
|
|
```
|
|
|
|
## Deploying the tinyllama model
|
|
|
|
To play with the tinyllama model you downloaded, start a llama-server pod
|
|
serving the model and a service to make the pod available to other pods.
|
|
|
|
```shell
|
|
cat <<'EOF' | kubectl apply -f -
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: tinyllama
|
|
spec:
|
|
replicas: 1
|
|
selector:
|
|
matchLabels:
|
|
app: tinyllama
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: tinyllama
|
|
name: tinyllama
|
|
spec:
|
|
containers:
|
|
- name: llama-server
|
|
image: quay.io/ramalama/ramalama:latest
|
|
command: [
|
|
llama-server,
|
|
--host, "0.0.0.0",
|
|
--port, "8080",
|
|
--model, /mnt/models/tinyllama-1.1b-chat-v1.0.Q8_0.gguf,
|
|
--alias, tinyllama,
|
|
--ctx-size, "2048",
|
|
--temp, "0.8",
|
|
--cache-reuse, "256",
|
|
-ngl, "999",
|
|
--threads, "6",
|
|
--no-warmup,
|
|
--log-colors, auto,
|
|
]
|
|
resources:
|
|
limits:
|
|
squat.ai/dri: 3
|
|
volumeMounts:
|
|
- name: models
|
|
mountPath: /mnt/models
|
|
volumes:
|
|
- name: models
|
|
hostPath:
|
|
path: /mnt/models
|
|
---
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
labels:
|
|
app: tinyllama
|
|
name: tinyllama
|
|
spec:
|
|
ports:
|
|
- protocol: TCP
|
|
port: 8080
|
|
selector:
|
|
app: tinyllama
|
|
EOF
|
|
```
|
|
|
|
Wait until the deployment is available:
|
|
|
|
```
|
|
% kubectl get deploy tinyllama
|
|
NAME READY UP-TO-DATE AVAILABLE AGE
|
|
tinyllama 1/1 1 1 9m14s
|
|
```
|
|
|
|
Check the tinyllama service:
|
|
|
|
```shell
|
|
% kubectl get service tinyllama
|
|
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
|
tinyllama ClusterIP 10.98.219.117 <none> 8080/TCP 23m
|
|
```
|
|
|
|
## Deploying Open WebUI
|
|
|
|
The [Open WebUI](https://docs.openwebui.com) project provides an easy-to-use web
|
|
interface for interacting with OpenAI-compatible APIs such as our llama-server
|
|
pods.
|
|
|
|
To deploy Open WebUI, run:
|
|
|
|
```shell
|
|
---
|
|
cat <<'EOF' | kubectl apply -f -
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: open-webui
|
|
spec:
|
|
replicas: 1
|
|
selector:
|
|
matchLabels:
|
|
app: open-webui
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: open-webui
|
|
spec:
|
|
containers:
|
|
- name: open-webui
|
|
image: ghcr.io/open-webui/open-webui:dev-slim
|
|
ports:
|
|
- containerPort: 8080
|
|
env:
|
|
# Preconfigure OpenAI-compatible endpoints
|
|
- name: OPENAI_API_BASE_URLS
|
|
value: "http://granite:8080/v1;http://tinyllama:8080/v1"
|
|
volumeMounts:
|
|
- name: open-webui-data
|
|
mountPath: /app/backend/data
|
|
volumes:
|
|
- name: open-webui-data
|
|
persistentVolumeClaim:
|
|
claimName: open-webui-data
|
|
---
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: open-webui-data
|
|
spec:
|
|
storageClassName: standard
|
|
accessModes:
|
|
- ReadWriteOnce
|
|
resources:
|
|
requests:
|
|
storage: 1Gi
|
|
---
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: open-webui
|
|
spec:
|
|
ports:
|
|
- protocol: TCP
|
|
port: 8080
|
|
nodePort: 30080
|
|
selector:
|
|
app: open-webui
|
|
type: NodePort
|
|
EOF
|
|
```
|
|
|
|
We configured our llama-server OpenAI compatible API endpoints using the
|
|
`OPENAI_API_BASE_URLS` environment variable.
|
|
Check [Open WebUI documenation](https://docs.openwebui.com) to learn how to
|
|
configure it using the Admin Panel.
|
|
|
|
Wait until the deployment is available:
|
|
|
|
```shell
|
|
% kubectl get deploy open-webui
|
|
NAME READY UP-TO-DATE AVAILABLE AGE
|
|
open-webui 1/1 1 1 69s
|
|
```
|
|
|
|
## Interacting with the models
|
|
|
|
Open a browser with Open WebUI console:
|
|
|
|
```shell
|
|
open $(minikube service open-webui --url)
|
|
```
|
|
|
|
Create an Admin account to start using Open WebUI.
|
|
|
|
### Chatting with the granite model
|
|
|
|
You can start chatting with the "ibm/granite:7b" model.
|
|
|
|
Type a prompt:
|
|
|
|
```
|
|
> Write a very technical haiku about playing with large language models with Minikube on Apple silicon
|
|
|
|
Mighty model, Minikube,
|
|
Silicon-powered speed,
|
|
Learning's dance, ever-changing.
|
|
Through data streams it weaves,
|
|
Inference's wisdom, vast and deep,
|
|
Apple's heartbeat, in code, resounds.
|
|
Exploring AI's vast frontier,
|
|
Minikube, language model's playground,
|
|
Innovation's rhythm, forever.
|
|
```
|
|
|
|
### Chatting with the tinyllama model
|
|
|
|
Click the "New Chat" button on the left and select the "tinyllama" model from
|
|
the model menu in the top left.
|
|
|
|
Type a prompt:
|
|
|
|
```
|
|
> How do you feel inside this fancy Minikube cluster?
|
|
|
|
I do not have a physical body. However, based on the given text material, the
|
|
author is describing feeling inside a cluster of Minikube, a type of jellyfish.
|
|
The use of the word "fancy" suggests that the author is impressed or appreciates
|
|
the intricate design of the cluster, while the adjective "minikube" connotes its
|
|
smooth texture, delicate shape, and iridescent colors. The word "cluster"
|
|
suggests a group of these jellyfish, while "inside" implies being in the
|
|
vicinity or enclosed within.
|
|
```
|