--- title: "Minikube AI playground on Apple silicon" linkTitle: "Minikube AI playground on Apple silicon" weight: 1 date: 2024-10-04 --- This tutorial shows how to create an AI playground with minikube on Apple silicon devices such as a MacBook Pro. We'll create a cluster that shares your Mac's GPU using the krunkit driver, deploy two large language models, and interact with the models using Open WebUI. ![Open WebUI Chat](/images/open-webui-chat.png) ## Prerequisites - Apple silicon Mac - [krunkit](https://github.com/containers/krunkit) v1.0.0 or later - [vmnet-helper](https://github.com/nirs/vmnet-helper) v0.6.0 or later - [generic-device-plugin](https://github.com/squat/generic-device-plugin) - minikube v1.37.0 or later (krunkit driver only) ## Installing krunkit and vmnet-helper Install latest krunkit: ```shell brew tap slp/krunkit brew install krunkit krunkit --version ``` Instal latest vmnet-helper: ```shell curl -fsSL https://github.com/minikube-machine/vmnet-helper/releases/latest/download/install.sh | bash /opt/vmnet-helper/bin/vmnet-helper --version ``` For more information, see the [krunkit driver](https://minikube.sigs.k8s.io/docs/drivers/krunkit/) documentation. ## Download models Download some models to the local disk. By keeping the models outside of minikube, you can create and delete clusters quickly without downloading the models again. ```shell mkdir ~/models cd ~/models curl -LO 'https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf?download=true' curl -LO 'https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q8_0.gguf?download=true' ``` **Important**: The model must be in *GGUF* format. ## Start minikube Start a minikube cluster with the krunkit driver, mounting the `~/models` directory at `/mnt/models`: ```shell minikube start --driver krunkit --mount-string ~/models:/mnt/models ``` Output: ``` 😄 minikube v1.37.0 on Darwin 15.6.1 (arm64) ✨ Using the krunkit (experimental) driver based on user configuration 👍 Starting "minikube" primary control-plane node in "minikube" cluster 🔥 Creating krunkit VM (CPUs=2, Memory=6144MB, Disk=20000MB) ... 🐳 Preparing Kubernetes v1.34.0 on Docker 28.4.0 ... 🔗 Configuring bridge CNI (Container Networking Interface) ... 🔎 Verifying Kubernetes components... ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5 🌟 Enabled addons: storage-provisioner, default-storageclass 🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default ``` ### Verifying that the GPU is available The krunkit driver exposes your host GPU as a virtio-gpu device: ``` % minikube ssh -- tree /dev/dri /dev/dri |-- by-path | |-- platform-a007000.virtio_mmio-card -> ../card0 | `-- platform-a007000.virtio_mmio-render -> ../renderD128 |-- card0 `-- renderD128 ``` ## Deploying the generic-device-plugin To use the GPU in pods, we need the generic-device-plugin. Deploy it with: ```shell cat <<'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: name: generic-device-plugin namespace: kube-system labels: app.kubernetes.io/name: generic-device-plugin spec: selector: matchLabels: app.kubernetes.io/name: generic-device-plugin template: metadata: labels: app.kubernetes.io/name: generic-device-plugin spec: priorityClassName: system-node-critical tolerations: - operator: "Exists" effect: "NoExecute" - operator: "Exists" effect: "NoSchedule" containers: - image: squat/generic-device-plugin args: - --device - | name: dri groups: - count: 4 paths: - path: /dev/dri name: generic-device-plugin resources: requests: cpu: 50m memory: 10Mi limits: cpu: 50m memory: 20Mi ports: - containerPort: 8080 name: http securityContext: privileged: true volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins - name: dev mountPath: /dev volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins - name: dev hostPath: path: /dev updateStrategy: type: RollingUpdate EOF ``` **Note**: This configuration allows up to 4 pods to use `/dev/dri`. You can increase `count` to run more pods using the GPU. Wait until the generic-device-plugin DaemonSet is available: ```shell % kubectl get daemonset generic-device-plugin -n kube-system -w NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE generic-device-plugin 1 1 1 1 1 45s ``` ## Deploying the granite model To play with the granite model you downloaded, start a llama-server pod serving the model and a service to make the pod available to other pods. ```shell cat <<'EOF' | kubectl apply -f - --- apiVersion: apps/v1 kind: Deployment metadata: name: granite spec: replicas: 1 selector: matchLabels: app: granite template: metadata: labels: app: granite name: granite spec: containers: - name: llama-server image: quay.io/ramalama/ramalama:latest command: [ llama-server, --host, "0.0.0.0", --port, "8080", --model, /mnt/models/granite-7b-lab-Q4_K_M.gguf, --alias, "ibm/granite:7b", --ctx-size, "2048", --temp, "0.8", --cache-reuse, "256", -ngl, "999", --threads, "6", --no-warmup, --log-colors, auto, ] resources: limits: squat.ai/dri: 1 volumeMounts: - name: models mountPath: /mnt/models volumes: - name: models hostPath: path: /mnt/models --- apiVersion: v1 kind: Service metadata: labels: app: granite name: granite spec: ports: - protocol: TCP port: 8080 selector: app: granite EOF ``` Wait until the deployment is available: ```shell % kubectl get deploy granite NAME READY UP-TO-DATE AVAILABLE AGE granite 1/1 1 1 8m17s ``` Check the granite service: ```shell % kubectl get service granite NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE granite ClusterIP 10.105.145.9 8080/TCP 28m ``` ## Deploying the tinyllama model To play with the tinyllama model you downloaded, start a llama-server pod serving the model and a service to make the pod available to other pods. ```shell cat <<'EOF' | kubectl apply -f - --- apiVersion: apps/v1 kind: Deployment metadata: name: tinyllama spec: replicas: 1 selector: matchLabels: app: tinyllama template: metadata: labels: app: tinyllama name: tinyllama spec: containers: - name: llama-server image: quay.io/ramalama/ramalama:latest command: [ llama-server, --host, "0.0.0.0", --port, "8080", --model, /mnt/models/tinyllama-1.1b-chat-v1.0.Q8_0.gguf, --alias, tinyllama, --ctx-size, "2048", --temp, "0.8", --cache-reuse, "256", -ngl, "999", --threads, "6", --no-warmup, --log-colors, auto, ] resources: limits: squat.ai/dri: 3 volumeMounts: - name: models mountPath: /mnt/models volumes: - name: models hostPath: path: /mnt/models --- apiVersion: v1 kind: Service metadata: labels: app: tinyllama name: tinyllama spec: ports: - protocol: TCP port: 8080 selector: app: tinyllama EOF ``` Wait until the deployment is available: ``` % kubectl get deploy tinyllama NAME READY UP-TO-DATE AVAILABLE AGE tinyllama 1/1 1 1 9m14s ``` Check the tinyllama service: ```shell % kubectl get service tinyllama NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE tinyllama ClusterIP 10.98.219.117 8080/TCP 23m ``` ## Deploying Open WebUI The [Open WebUI](https://docs.openwebui.com) project provides an easy-to-use web interface for interacting with OpenAI-compatible APIs such as our llama-server pods. To deploy Open WebUI, run: ```shell --- cat <<'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: open-webui spec: replicas: 1 selector: matchLabels: app: open-webui template: metadata: labels: app: open-webui spec: containers: - name: open-webui image: ghcr.io/open-webui/open-webui:dev-slim ports: - containerPort: 8080 env: # Preconfigure OpenAI-compatible endpoints - name: OPENAI_API_BASE_URLS value: "http://granite:8080/v1;http://tinyllama:8080/v1" volumeMounts: - name: open-webui-data mountPath: /app/backend/data volumes: - name: open-webui-data persistentVolumeClaim: claimName: open-webui-data --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: open-webui-data spec: storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: 1Gi --- apiVersion: v1 kind: Service metadata: name: open-webui spec: ports: - protocol: TCP port: 8080 nodePort: 30080 selector: app: open-webui type: NodePort EOF ``` We configured our llama-server OpenAI compatible API endpoints using the `OPENAI_API_BASE_URLS` environment variable. Check [Open WebUI documenation](https://docs.openwebui.com) to learn how to configure it using the Admin Panel. Wait until the deployment is available: ```shell % kubectl get deploy open-webui NAME READY UP-TO-DATE AVAILABLE AGE open-webui 1/1 1 1 69s ``` ## Interacting with the models Open a browser with Open WebUI console: ```shell open $(minikube service open-webui --url) ``` Create an Admin account to start using Open WebUI. ### Chatting with the granite model You can start chatting with the "ibm/granite:7b" model. Type a prompt: ``` > Write a very technical haiku about playing with large language models with Minikube on Apple silicon Mighty model, Minikube, Silicon-powered speed, Learning's dance, ever-changing. Through data streams it weaves, Inference's wisdom, vast and deep, Apple's heartbeat, in code, resounds. Exploring AI's vast frontier, Minikube, language model's playground, Innovation's rhythm, forever. ``` ### Chatting with the tinyllama model Click the "New Chat" button on the left and select the "tinyllama" model from the model menu in the top left. Type a prompt: ``` > How do you feel inside this fancy Minikube cluster? I do not have a physical body. However, based on the given text material, the author is describing feeling inside a cluster of Minikube, a type of jellyfish. The use of the word "fancy" suggests that the author is impressed or appreciates the intricate design of the cluster, while the adjective "minikube" connotes its smooth texture, delicate shape, and iridescent colors. The word "cluster" suggests a group of these jellyfish, while "inside" implies being in the vicinity or enclosed within. ```