Commit Graph

435 Commits (release-1.28)

Author SHA1 Message Date
Brad Davidson 2d0661e3a5 Fix issue with loadbalancer failover to default server
The loadbalancer should only fail over to the default server if all other server have failed, and it should force fail-back to a preferred server as soon as one passes health checks.

The loadbalancer tests have been improved to ensure that this occurs.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-11-14 09:41:51 -08:00
Brad Davidson ac44d6a5cf Add nonroot-devices flag to agent CLI
Add new flag that is passed through to the device_ownership_from_security_context parameter in the containerd CRI config. This is not possible to change without providing a complete custom containerd.toml template so we should add a flag for it.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 56fb3b0991)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-11-07 13:11:46 -08:00
manuelbuil 4c2437273a Add the nvidia runtime cdi
Signed-off-by: manuelbuil <mbuil@suse.com>
2024-10-12 07:39:12 +02:00
Brad Davidson 8ae29087c2 Update tcpproxy for import path change
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 1ae9ca73f5)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-10-10 11:41:08 -07:00
Vitor Savian eac26ef041 Add user path to runtimes search
Signed-off-by: Vitor Savian <vitor.savian@suse.com>
2024-10-08 13:19:39 -03:00
Brad Davidson d5b7bac79f Fix hosts.toml header var
Resolves issue from 270f85e468 that prevented old hosts.toml files from being cleaned up.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-09-10 15:00:29 -07:00
Brad Davidson bdebbd3703 Only clean up containerd hosts dirs managed by k3s
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 270f85e468)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-09-06 11:30:49 -07:00
Derek Nola 47f1d34d0b Allow Pprof and Superisor metrics in standalone mode (#10576)
* Allow pprof to run on server with `--disable-agent`
* Allow supervisor metrics to run on server with `--disable-agent`

Signed-off-by: Derek Nola <derek.nola@suse.com>
2024-08-06 08:51:24 -07:00
Brad Davidson 199d9eae8a
[release-1.28] Backports for 2024-08 release cycle (#10666)
* Use pagination when retrieving etcd snapshot list

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit c2216a62ad)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

* Update secretsencrypt pagination

Make secretsencrypt page size and iteration consistent with other paginators

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 891e72f90f)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

* Cap length of generated name used for servicelb daemonset

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 21611c5665)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

* Fix ipv6 sysctl required by non-ipv6 LoadBalancer service

This is a partial revert of 095ecdb034,
with the workaround moved into klipper-lb.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit d4c3422a85)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

* remove deprecated use of wait functions

Signed-off-by: Will <will7989@hotmail.com>
(cherry picked from commit e4f3cc7b54)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

* Update pkg/cluster/managed.go

Co-authored-by: Derek Nola <derek.nola@suse.com>
Signed-off-by: Will Andrews <will7989@hotmail.com>
(cherry picked from commit e2179aa957)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

* Wire lasso metrics up to common gatherer

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit e168438d44)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

* Fix cloudprovider controller name

Looking at metrics revealed the cloudprovider controller name was anempty string.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit bffdf463e1)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

---------

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Will <will7989@hotmail.com>
Signed-off-by: Will Andrews <will7989@hotmail.com>
Co-authored-by: Will <will7989@hotmail.com>
Co-authored-by: Derek Nola <derek.nola@suse.com>
2024-08-05 09:35:31 -07:00
Brad Davidson e1ec61ff43 Add dial duration to debug error message
This should give us more detail on how long dials take before failing, so that we can perhaps better tune the retry loop in the future.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-07-15 10:14:40 -07:00
Brad Davidson 69775ef6cc Fix IPv6 primary node-ip handling
I should have caught `[]string{cfg.NodeIP}[0]` and `[]string{envInfo.NodeIP.String()}[0]` in code review...

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-07-15 10:14:40 -07:00
Brad Davidson af6df48158 Fix agents removing configured supervisor address
We shouldn't be replacing the configured server address on agents. Doing
so breaks the agent's ability to fall back to the fixed registration
endpoint when all servers are down, since we replaced it with the first
discovered apiserver address. The fixed registration endpoint will be
restored as default when the service is restarted, but this is not the
correct behavior. This should have only been done on etcd-only nodes
that start up using their local supervisor, but need to switch to a
control-plane node as soon as one is available.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-07-15 10:14:40 -07:00
Brad Davidson 662f31ed92 Fix reentrant rlock in loadbalancer.dialContext
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-07-15 10:14:40 -07:00
Brad Davidson 8c67a3091e Ensure remotedialer kubelet connections use kubelet bind address
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit eb8bd15889)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-07-15 10:14:40 -07:00
Roberto Bonafiglia d076d9a78c Update flannel to v0.25.4 and fixed issue with IPv6 mask
Signed-off-by: Roberto Bonafiglia <roberto.bonafiglia@suse.com>
2024-07-01 18:59:01 +02:00
Brad Davidson dd86a0581f Fix agent supervisor port using apiserver port instead
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-06-13 15:39:36 -07:00
Harrison Affel b261f298eb fix typo, use rancher/permissions
Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>
2024-06-07 08:31:31 -07:00
Brad Davidson 5773a34447 Fix race condition panic in loadbalancer.nextServer
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-06-07 07:40:28 -07:00
fmoral2 9d5becb2b9
Add test for `isValidResolvConf` (#10302) (#10331)
Signed-off-by: Francisco <francisco.moral@suse.com>
2024-06-07 11:06:52 -03:00
Brad Davidson 7de7adb2e4 Fix bug that caused agents to bypass local loadbalancer
If proxy.SetAPIServerPort was called multiple times, all calls after the
first one would cause the apiserver address to be set to the default
server address, bypassing the local load-balancer. This was most likely
to occur on RKE2, where the supervisor may be up for a period of time
before it is ready to manage node password secrets, causing the agent
to retry.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 1661f1024a)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-06-04 13:21:47 -07:00
Brad Davidson 0e1c565c97 Fix embedded mirror blocked by SAR RBAC and re-enable test
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
Brad Davidson f2267fed40 Fix issue caused by sole server marked as failed under load
If health checks are failing for all servers, make a second pass through the server list with health-checks ignored before returning failure

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit ca39614d4e)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
Brad Davidson a8d082b51c Fix netpol crash when node remains tained unintialized
It is concievable that users might take more than 60 seconds to deploy their own cloud-provider. Instead of exiting, we should wait forever, but with more logging to indicate what's being waited on.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit ed23a2bb48)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
Brad Davidson 7ace0d9142 Refactor supervisor listener startup and add metrics
* Refactor agent supervisor listener startup and authn/authz to use upstream
  auth delegators to perform for SubjectAccessReview for access to
  metrics.
* Convert spegel and pprof handlers over to new structure.
* Promote bind-address to agent flag to allow setting supervisor bind
  address for both agent and server.
* Promote enable-pprof to agent flag to allow profiling agents. Access
  to the pprof endpoint now requires client cert auth, similar to the
  spegel registry api endpoint.
* Add prometheus metrics handler.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit ff679fb3ab)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
linxin 332761c387 Validate resolv.conf for presence of nameserver entries
Co-authored-by: Brad Davidson <brad@oatmail.org>
Signed-off-by: linxin <linxin@geedgenetworks.com>
(cherry picked from commit f24ba9d3a9)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
Brad Davidson 454a478969 Switch stargz over to cri registry config_path
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 30999f9a07)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
Brad Davidson 2cc7766dcf Use fixed stream server bind address for cri-dockerd
Will now use 127.0.0.1:10010, same as containerd's CRI

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 7374010c0c)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
Brad Davidson 27ed5e3b8f Add WithSkipMissing to not fail import on missing blobs
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 5f6b813cc8)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:17:13 -07:00
Thomas Ferrandiz 695192e5ff Use TrafficManager interface when calling flannel
Signed-off-by: Thomas Ferrandiz <thomas.ferrandiz@suse.com>
2024-05-28 07:39:03 +00:00
Thomas Ferrandiz a7cc980e15 Bump flannel version to v0.25.2
Signed-off-by: Thomas Ferrandiz <thomas.ferrandiz@suse.com>
2024-05-28 07:37:58 +00:00
Harrison Affel 1c33ee6859 windows changes
Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>
2024-05-16 17:31:14 -07:00
Brad Davidson 870030cc9a Add workaround for containerd hosts.toml bug
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit f2961fb5d2)
2024-04-11 10:01:16 -07:00
Brad Davidson d5a85d7307 Add certificate expiry check and warnings
* Add ADR
* Add `k3s certificate check` command.
* Add periodic check and events when certs are about to expire.
* Add metrics for certificate validity remaining, labeled by cert subject

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 7f659759dd)
2024-04-11 10:01:16 -07:00
Brad Davidson f8e4828963 Add health-check support to loadbalancer
* Adds support for health-checking loadbalancer servers. If a
  health-check fails when dialing, all existing connections to the
  server will be closed.
* Wires up a remotedialer tunnel connectivity check as the health check
  for supervisor/apiserver connections.
* Wires up a simple ping request to the supervisor port as the health
  check for etcd connections.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit c51d7bfbd1)
2024-04-11 10:01:16 -07:00
Brad Davidson 6d2e9314b1 Fix error when image has already been pulled
CRI and containerd APIs disagree about the registry names - CRI supports
index.docker.io as an alias for docker.io, while containerd does not.
Use the actual stored RepoTag to determine what image to ask containerd for.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit f099bfa508)
2024-04-11 10:01:16 -07:00
Derek Nola 993f67738b
Transition from deprecated pointer library to ptr (#9801) (#9824)
Signed-off-by: Derek Nola <derek.nola@suse.com>
2024-03-30 21:30:10 -07:00
Derek Nola 1cde5b83ce
Remove old pinned dependencies (#9827)
Signed-off-by: Derek Nola <derek.nola@suse.com>
2024-03-30 21:29:45 -07:00
Brad Davidson aa3a18ba9b Fix wildcard entry upstream fallback
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-12 23:31:33 -07:00
Brad Davidson c15e17676d Warn and suppress duplicate registry mirror endpoints
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-07 16:36:56 -08:00
Vitor Savian 8202e9305e Fix wildcard with embbeded registry test
Signed-off-by: Vitor Savian <vitor.savian@suse.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 59c724f7a6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-07 16:36:56 -08:00
Flavio Castelli 0d777dcb2f fix: use correct wasm shims names
Fix the wasm shim detection and the containerd configuration generation.

Prior to this commit, the binary and the `RuntimeType` values were not
correct.

Signed-off-by: Flavio Castelli <fcastelli@suse.com>
(cherry picked from commit 64e4f0e6e7)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-07 16:36:56 -08:00
Brad Davidson 357be4aa02 Fix additional corner cases in registries handling
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit b164d7a270)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-07 16:36:56 -08:00
Brad Davidson 853473c180 Tweak netpol node wait logs
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 513c3416e7)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-07 16:36:56 -08:00
Brad Davidson 5ff3108ef1 Fix NodeHosts on dual-stack clusters
* Add both dual-stack addresses to the node hosts file
* Add hostname to hosts file as alias for node name to ensure consistent resolution

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit be569f65a9)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-07 16:36:56 -08:00
Roberto Bonafiglia 1f44f83627 Adjust first node-ip based on configured clusterCIDR
Signed-off-by: Roberto Bonafiglia <roberto.bonafiglia@suse.com>
2024-03-06 11:11:38 +01:00
Brad Davidson 051b14b248 Fix netpol startup when flannel is disabled
Don't break out of the poll loop if we can't get the node, RBAC might not be ready yet.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-02-26 17:40:44 -08:00
Derek Nola 9c0e5a5ff8 Rename AgentReady to ContainerRuntimeReady for better clarity
Signed-off-by: Derek Nola <derek.nola@suse.com>
2024-02-21 13:26:08 -08:00
Derek Nola 80baec697f Restore original order of agent startup functions
Signed-off-by: Derek Nola <derek.nola@suse.com>
2024-02-21 13:26:08 -08:00
Derek Nola 45860105bb
[Release-1.28] Test_UnitApplyContainerdQoSClassConfigFileIfPresent (#9440)
* [Testing]: Test_UnitApplyContainerdQoSClassConfigFileIfPresent (Created) (#8945)

Problem:
Function not tested.

Solution:
Unit test added.

Signed-off-by: Oliver Larsson <larsson.e.oliver@gmail.com>
---------

Signed-off-by: Oliver Larsson <larsson.e.oliver@gmail.com>
Signed-off-by: Derek Nola <derek.nola@suse.com>
Co-authored-by: Oliver Larsson <larsson.e.oliver@gmail.com>
2024-02-12 09:33:32 -08:00
Harrison Affel a922a0e340 allow executors to define containerd and docker behavior
Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>
2024-02-09 16:05:58 -03:00