* vfkit: Remove temporary variable
Remove temporary and unneeded mac variable. It is easier to follow the
code when we use d.MACAddress.
* vfkit: Promote state change to INFO level
System state changes should be more visible to make debugging easier.
* vmnet: Add vmnet package
The package manages the vmnet-helper[1] child process, providing
connection to the vmnet network without running the guest as root.
We will use vmnet-helper for the vfkit driver, which does not have a way
to use shared network, when guests can access other guest in the
network. We can use it later with the qemu driver as alternative to
socket_vmnet.
[1] https://github.com/nirs/vmnet-helper
* vfkit: add vmnet-shared network
Add new network option for vfkit "vmnet-shared", connecting vfkit to the
vmnet shared network. Clusters using this network can access other
clusters in the same network, similar to socket_vmnet with QEMU driver.
If network is not specified, we default to the "nat" network, keeping
the previous behavior. If network is "vmnet-shared", the vfkit driver
manages 2 processes: vfkit and vmnet-helper.
Like vfkit, vmnet-helper is started in the background, in a new process
group, so it not terminated if the minikube process group is terminate.
Since vmnet-helper requires root to start the vmnet interface, we start
it with sudo, creating 2 child processes. vmnet-helper drops privileges
immediately after starting the vmnet interface, and run as the user and
group running minikube.
Stopping the cluster will stop sudo, which will stop the vmnet-helper
process. Deleting the cluster kill both sudo and vmnet-helper by killing
the process group.
This change is not complete, but it is good enough to play with the new
shared network.
Example usage:
1. Install vmnet-helper:
https://github.com/nirs/vmnet-helper?tab=readme-ov-file#installation
2. Setup vmnet-helper sudoers rule:
https://github.com/nirs/vmnet-helper?tab=readme-ov-file#granting-permission-to-run-vmnet-helper
3. Start 2 clusters with vmnet-shared network:
% minikube start -p c1 --driver vfkit --network vmnet-shared
...
% minikube start -p c2 --driver vfkit --network vmnet-shared
...
% minikube ip -p c1
192.168.105.18
% minikube ip -p c2
192.168.105.19
4. Both cluster can access the other cluster:
% minikube -p c1 ssh -- ping -c 3 192.168.105.19
PING 192.168.105.19 (192.168.105.19): 56 data bytes
64 bytes from 192.168.105.19: seq=0 ttl=64 time=0.621 ms
64 bytes from 192.168.105.19: seq=1 ttl=64 time=0.989 ms
64 bytes from 192.168.105.19: seq=2 ttl=64 time=0.490 ms
--- 192.168.105.19 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.490/0.700/0.989 ms
% minikube -p c2 ssh -- ping -c 3 192.168.105.18
PING 192.168.105.18 (192.168.105.18): 56 data bytes
64 bytes from 192.168.105.18: seq=0 ttl=64 time=0.289 ms
64 bytes from 192.168.105.18: seq=1 ttl=64 time=0.798 ms
64 bytes from 192.168.105.18: seq=2 ttl=64 time=0.993 ms
--- 192.168.105.18 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.289/0.693/0.993 ms
* reason: Remove trailing whitepsace
Trailing whitespace is removed by some editors or displayed as a
warning. Clean up to make it easy to make maintain this file.
* start: Validate vfkit --network option
The vfkit driver supports now `nat` and `vmnet-shared` network options.
The `nat` option provides the best performance and is always available,
so it is the default network option. The `vmnet-shared` option provides
access between machines with lower performance compared to `nat`.
If `vment-shared` option is selected, we verify that vmnet-helper is
available. The check ensure that vmnet-helper is installed and sudoers
configuration allows the current user to run vment-helper without a
password.
If validating vment-helper failed, we return a new NotFoundVmnetHelper
reason pointing to vment-helper installation docs or recommending to use
`nat`. This is based on how we treat missing socket_vmnet for QEMU
driver.
* site: Document vfkit network options
* Fix waiting for all kube-system pods having one of specified labels to be Ready
* Fix waiting for all kube-system pods having one of specified labels to be Ready
* Fix waiting for all kube-system pods having one of specified labels to be Ready
* Add process package
Current code contains multiple implementations for managing a process
using pids, with various issues:
- Some are unsafe, terminating a process by pid without validating that
the pid belongs to the right process. Some use unclear
- Using unclear terms like checkPid() (what does it mean?)
- Some are missing tests
Let's clean up the mess by introducing a process package. The package
provides:
- process.WritePidfile(): write a pid to file
- process.ReadPidfile(): read pid from file
- process.Exists(): tells if process matching pid and name exists
- process.Terminate() terminates a process matching pid and name
- process.Kil() kill a process matching pid and name
The library is tested on linux, darwin, and windows. On windows we don't
have a standard way to terminate a process gracefully, so
process.Terminate() is the same as process.Kill().
I want to use this package in vfkit and the new vment package, and later
we can use it for qemu, hyperkit, and other code using managing
processes with pids.
* vfkit: Use process pidfile helpers
- Simplify GetState() using process.ReadPidfile()
- Simplify Start() using process.WritePidfile()
* vfkit: Simpler and more robust GetState()
GetState() had several issues:
- When accessing vfkit HTTP API, we handled only "running",
"VirtualMachineStateRunning", "stopped", and
"VirtualMachineStateStopped", but there are other 10 possible states,
which we handled as state.None, when vfkit is running and need to be
stopped. This can lead to wrong handling in the caller.
- When handling "stopped" and "VirtualMachineStateStopped" we returned
state.Stopped, but did not remove the pidfile. This can cause
termination of unrelated process or reporting wrong status when the
pid is reused.
- Accessing the HTTP API will fail after we stop or kill it. This
cause GetState() to fail when the process is actually stopped, which
can lead to unnecessary retries and long delays (#20503).
- When retuning state.None during Remove(), we use tried to do graceful
shutdown which does not make sense in minikube delete flow, and is not
consistent with state.Running handling.
Accessing vfkit API to check for state does not add much value for our
used case, checking if the vfkit process is running, and it is not
reliable.
Fix all the issues by not using the HTTP API in GetState(), and use only
the process state. We still use the API for stopping and killing vfkit
to do graceful shutdown. This also simplifies Remove(), since we need to
handle only the state.Running state.
With this change we consider vfkit as stopped only when the process does
not exist, which takes about 3 seconds after the state is reported as
"stopped".
Example stop flow:
I0309 18:15:40.260249 18857 main.go:141] libmachine: Stopping "minikube"...
I0309 18:15:40.263225 18857 main.go:141] libmachine: set state: {State:Stop}
I0309 18:15:46.266902 18857 main.go:141] libmachine: Machine "minikube" was stopped.
I0309 18:15:46.267122 18857 stop.go:75] duration metric: took 6.127761459s to stop
Example delete flow:
I0309 17:00:49.483078 18127 out.go:177] * Deleting "minikube" in vfkit ...
I0309 17:00:49.499252 18127 main.go:141] libmachine: set state: {State:HardStop}
I0309 17:00:49.569938 18127 lock.go:35] WriteFile acquiring /Users/nir/.kube/config: ...
I0309 17:00:49.573977 18127 out.go:177] * Removed all traces of the "minikube" cluster.
* vfkit: Use process.Exists()
Previously we did not check the process name when checking a pid from a
pidfile. If the pidfile became state we would assume that vfkit is
running and try to stop it via the HTTP API, which would never succeed.
Now we detect stale pidfile and remove it.
If removing the stale pidfile fails, we don't want to fail the operation
since we know that vfkit is not running. We log the failure to aid
debugging of stale pidfile.
* vfkit: More robust Stop()
If setting vfkit state to "Stop" fails, we used to return an error.
Retrying the operation may never succeed.
Fix by falling back to terminating vfkit using a signal. This terminates
vfkit immediately similar to HardStop[1].
We can still fail if the pidfile is corrupted but this is unlikely and
requires manual cleanup.
In the case when we are sure the vfkit process does not exist, we remove
the pidfile immediately, avoiding leftover pidfile if the caller does
not call GetState() after Stop().
[1] https://github.com/crc-org/vfkit/issues/284
* vfkit: More robust Kill()
We know that setting the state to `HardStop` typically fails:
I0309 19:19:42.378591 21795 out.go:177] 🔥 Deleting "minikube" in vfkit ...
W0309 19:19:42.397472 21795 delete.go:106] remove failed, will retry: kill: Post "http://_/vm/state": EOF
This may lead to unnecessary retries and delays. Fix by falling back to
sending a SIGKILL signal.
Example delete flow when setting vfkit state fails:
I0309 20:07:41.688259 25540 out.go:177] 🔥 Deleting "minikube" in vfkit ...
I0309 20:07:41.712017 25540 main.go:141] libmachine: Failed to set vfkit state to 'HardStop': Post "http://_/vm/state": EOF