* Add missing default OS for split server test
Signed-off-by: Derek Nola <derek.nola@suse.com>
* Launch go routine and return for k3s secrets-encrypt reencrypt
Signed-off-by: Derek Nola <derek.nola@suse.com>
---------
Signed-off-by: Derek Nola <derek.nola@suse.com>
The loadbalancer should only fail over to the default server if all other server have failed, and it should force fail-back to a preferred server as soon as one passes health checks.
The loadbalancer tests have been improved to ensure that this occurs.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
This fixes: 'error: no Auth Provider found for name "oidc"' when trying to run any subcommands in kubectl that require a valid server login.
Signed-off-by: Ludo Stellingwerff <ludo.stellingwerff@gmail.com>
External CLI actions cannot short-circuit on --help or --version, so we
cannot skip loading the config file if these flags are present when
running these wrapped commands. The behavior of just returning the
override flag name instead of the requested flag value was breaking
data-dir lookup when running wrapped commands.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Add new flag that is passed through to the device_ownership_from_security_context parameter in the containerd CRI config. This is not possible to change without providing a complete custom containerd.toml template so we should add a flag for it.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Use clientv3.NewCtxClient instead of New to avoid automatic retry of all RPCs
* Only timeout status requests; allow defrag and alarm clear requests to run to completion.
* Only clear alarms on the local cluster member, not ALL cluster members
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Fix vagrant/libvirt composite action for ubuntu-24.04
* Don't ignore changes to internal actions
* Fix unit tests for ubuntu 24.04, new lsof version
* Pin os version for unit and E2E workflows
Signed-off-by: Derek Nola <derek.nola@suse.com>
Don't delete s3 etcdsnapshotfiles if they are missing from s3 but less than a minute old, its possible the other node just finished uploading it and the object key has not yet become visible.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Used https://github.com/coredns/corefile-migration to
migrate the corefile. There are no changes for the
default file from 1.10.1 to 1.11.3.
Notable plugin changes include the k8s_external with fallthrough option
and rewrite with cname_target option.
These changes are not part of the default config that ships
with k3s. Customers using these two plugins can start using the new options
Metrics does not have any new features other than build tooling updates.
Requires https://github.com/rancher/image-mirror/pull/704
Signed-off-by: Harsimran Singh Maan <maan.harry@gmail.com>
Also silences warnings about bootstrap fields that are not intended to be handled by CA rotation
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
FindString would silently skip parsing dropins if the main config file
didn't exist. If a custom config file path was passed it would raise an
error, but if we were parsing the default config file and it didn't
exist it would just silently fail to load the dropins.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Allow pprof to run on server with `--disable-agent`
* Allow supervisor metrics to run on server with `--disable-agent`
Signed-off-by: Derek Nola <derek.nola@suse.com>
Fixes an issue where running etcd-snapshot commands on a node that has a server address set in the config will manage snapshots on that server, instead of on the local node as intended.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
This should give us more detail on how long dials take before failing, so that we can perhaps better tune the retry loop in the future.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
I should have caught `[]string{cfg.NodeIP}[0]` and `[]string{envInfo.NodeIP.String()}[0]` in code review...
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
We shouldn't be replacing the configured server address on agents. Doing
so breaks the agent's ability to fall back to the fixed registration
endpoint when all servers are down, since we replaced it with the first
discovered apiserver address. The fixed registration endpoint will be
restored as default when the service is restarted, but this is not the
correct behavior. This should have only been done on etcd-only nodes
that start up using their local supervisor, but need to switch to a
control-plane node as soon as one is available.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Move snapshot structs and functions into pkg/etcd/snapshot
* Move s3 client code and functions into pkg/etcd/s3
* Refactor pkg/etcd to track snapshot and s3 moves
* Add support for reading s3 client config from secret
* Add minio client cache, since S3 client configuration can now be
changed at runtime by modifying the secret, and don't want to have to
create a new minio client every time we read config.
* Add tests for pkg/etcd/s3
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* chore: Bump Local Path Provisioner version
Made with ❤️️ by updatecli
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Fixes an issue where the semaphore wasn't permanently initialized
until a scheduled snapshot was taken, allowing multiple on-demand
snapshots to be taken until the first scheduled snapshot was triggered.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
If proxy.SetAPIServerPort was called multiple times, all calls after the
first one would cause the apiserver address to be set to the default
server address, bypassing the local load-balancer. This was most likely
to occur on RKE2, where the supervisor may be up for a period of time
before it is ready to manage node password secrets, causing the agent
to retry.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Add write-kubeconfig-group flag to server
* update kubectl unable to read config message for kubeconfig mode/group
Signed-off-by: Katherine Pata <me@kitty.sh>
If health checks are failing for all servers, make a second pass through the server list with health-checks ignored before returning failure
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
It is concievable that users might take more than 60 seconds to deploy their own cloud-provider. Instead of exiting, we should wait forever, but with more logging to indicate what's being waited on.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Refactor agent supervisor listener startup and authn/authz to use upstream
auth delegators to perform for SubjectAccessReview for access to
metrics.
* Convert spegel and pprof handlers over to new structure.
* Promote bind-address to agent flag to allow setting supervisor bind
address for both agent and server.
* Promote enable-pprof to agent flag to allow profiling agents. Access
to the pprof endpoint now requires client cert auth, similar to the
spegel registry api endpoint.
* Add prometheus metrics handler.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Start shared informer caches when k3s-etcd controller wins leader election. Previously, these were only started when the main k3s apiserver controller won an election. If the leaders ended up going to different nodes, some informers wouldn't be started
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* kubernetes 1.30.0-k3s1
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* Update go version to v1.22.2
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* update dynamiclistener and helm-controller
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* update go in go.mod to 1.22.2
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* update go in Dockerfiles
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* update cri-dockerd
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* Add proctitle package with linux and windows constraints
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* go mod tidy
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* Fixing setproctitle function
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
* update dynamiclistener to v0.6.0-rc1
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
---------
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
The default clientaccess request timeout is too short. Wait longer by default, and add the s3 timeout if s3 is enabled.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Update traefik chart to bump image tag and fix quoting
* Fix image quoting in flat manifests
* Update local-path-provisioner config to stop using deprecated hostpath volume type
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Prefer the address of the etcd member being joined, and seed the full address list immediately on startup.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Bump spegel to v0.0.20-k3s1
* Remove deprecated libp2p Pretty function
* Remove quic-go pin
Pinned version is now out of date, indirect dependencies are now newer, with CVE issue fixed
Signed-off-by: Derek Nola <derek.nola@suse.com>
* Adds support for health-checking loadbalancer servers. If a
health-check fails when dialing, all existing connections to the
server will be closed.
* Wires up a remotedialer tunnel connectivity check as the health check
for supervisor/apiserver connections.
* Wires up a simple ping request to the supervisor port as the health
check for etcd connections.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
CRI and containerd APIs disagree about the registry names - CRI supports
index.docker.io as an alias for docker.io, while containerd does not.
Use the actual stored RepoTag to determine what image to ask containerd for.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Don't clobber the providerID field and instance-type/region/zone labels if provided by the kubelet. This allows the user to set these to the correct values when using the embedded CCM in a real cloud environment.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Prevents joining nodes from being stuck with bad initial member list if there is a transient failure, or if they try to join themselves
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Fix the wasm shim detection and the containerd configuration generation.
Prior to this commit, the binary and the `RuntimeType` values were not
correct.
Signed-off-by: Flavio Castelli <fcastelli@suse.com>
* Set ServerNodeName in snapshot CLI setup
* Raise errer if ServerNodeName ends up empty some other way
* Fix status controller to use etcd node name annotation instead of prefix checking
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Add both dual-stack addresses to the node hosts file
* Add hostname to hosts file as alias for node name to ensure consistent resolution
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Reorder copy order for caching
* Enable longer http timeout requests
Signed-off-by: Derek Nola <derek.nola@suse.com>
* Setup reencrypt controller to run on all apiserver nodes
* Fix reencryption for disabling secrets encryption, reenable drone tests
The nodes controller was reading from the configmaps cache, but doesn't add any handlers, so if no other controller added configmap handlers, the cache would remain empty.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Fix issue with bare host or IP as endpoint
* Fix issue with localhost registries not defaulting to http.
* Move the registry template prep to a separate function,
and adds tests of that function so that we can ensure we're
generating the correct content.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Fixes issue where proxy support only honored server address via K3S_URL, not CLI or config.
* Fixes crash when agent proxy is enabled, but proxy env vars do not return a proxy URL for the server address (server URL is in NO_PROXY list).
* Adds tests
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Moving it into config.Agent so that we can use or modify it outside the context of containerd setup
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Layer leases never did what we wanted anyways, and this is the new approved interface for ensuring that images do not get GCd
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Forces other groups packaging k3s to intentionally choose to build k3s with an unvalidated golang version
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Render cri registry mirrors.x.endpoints and configs.x.tls into config_path; keep
using mirrors.x.rewrites and configs.x.auth those do not yet have an
equivalent in the new format.
The new config file format allows disabling containerd's fallback to the
default endpoint when using mirror endpoints; a new CLI flag is added to
control that behavior.
This also re-shares some code that was unnecessarily split into parallel
implementations for linux/windows versions. There is probably more work
to be done on this front but it's a good start.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
If a full reconcile wins the race against sync of an individual snapshot resource, or someone intentionally deletes the configmap, the data map could be nil and cause a crash.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
If the feature-gate is enabled, use status.hostIPs for dual-stack externalTrafficPolicy=Local support
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Remove KubeletCredentialProviders and JobTrackingWithFinalizers feature-gates, both of which are GA and cannot be disabled.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
While some implementations may support it, it appears that most don't,
and some may in fact return an error if it is requested.
We already stat the object to get the metadata anyway, so this was
unnecessary if harmless on most implementations.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Disable helm CRD installation for disable-helm-controller
The NewContext package requires config as input which would
require all third-party callers to update when the new go module
is published.
This change only affects the behaviour of installation of helm
CRDs. Existing helm crds installed in a cluster would not be removed
when disable-helm-controller flag is set on the server.
Addresses #8701
* address review comments
* remove redundant check
Signed-off-by: Harsimran Singh Maan <maan.harry@gmail.com>
* Tweaked order of ingress IPs in ServiceLB
Previously, ingress IPs were only string-sorted when returned
Sorted by IP family and string-sorted in each family as part of
filterByIPFamily method
* Update pkg/cloudprovider/servicelb.go
* Formatting
Signed-off-by: Jason Costello <jason@hazy.com>
Co-authored-by: Brad Davidson <brad@oatmail.org>
Omit snapshot list configmap entries for snapshots without extra metadata; reduce log level of warnings about missing s3 metadata files.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Problem:
Configuring qos-class features in containerd requres a custom containerd configuration template.
Solution:
Look for configuration files in default locations and configure containerd to use them if they exist.
Signed-off-by: Oliver Larsson <larsson.e.oliver@gmail.com>
Create a generic helper function that finds extra containerd runtimes.
The code was originally inside of the nvidia container discovery file.
Signed-off-by: Flavio Castelli <fcastelli@suse.com>
Discover the containerd shims based on runwasi that are already
available on the node.
The runtimes could have been installed either by a package manager or by
the kwasm operator.
Signed-off-by: Flavio Castelli <fcastelli@suse.com>
The containerd configuration on a Linux system now handles the nvidia
and the WebAssembly runtimes.
Signed-off-by: Flavio Castelli <fcastelli@suse.com>
---------
Signed-off-by: Flavio Castelli <fcastelli@suse.com>
These fields are only necessary when saving snapshots to S3, and will block restoration if attempted
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Silences error message from lasso - this is a normal startup condition
when no snapshots exist so we shouldn't log nasty looking errors.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Removing this in 002e6c43ee regressed
control-plane-only nodes, as we rely on the etcd client to update its
endpoint list internally so that we can use it to sync the load-balancer
address list.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Enable the feature-gate for both kubelet and cloud-controller-manager. Enabling it on only one side breaks RKE2, where feature-gates are not shared due to running in different processes.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* initial windows port.
Signed-off-by: Sean Yen <seanyen@microsoft.com>
Signed-off-by: Derek Nola <derek.nola@suse.com>
Co-authored-by: Derek Nola <derek.nola@suse.com>
Co-authored-by: Wei Ran <weiran@microsoft.com>
Write the extra metadata both locally and to S3. These files are placed such that they will not be used by older versions of K3s that do not make use of them.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Consolidate NewCertCommands
* Add support for user defined new token
* Add E2E testlets
Signed-off-by: Derek Nola <derek.nola@suse.com>
* Ensure agent token also changes
Signed-off-by: Derek Nola <derek.nola@suse.com>
* Add --image-service-endpoint flag
Problem:
External container runtime can be set but image service endpoint is unchanged
and also is not exposed as a flag. This is useful for using containerd
snapshotters outside of the ones that have built-in support like
stargz-snapshotter.
Solution:
Add a flag --image-service-endpoint and also default image service endpoint to
container runtime endpoint if set.
Signed-off-by: Edgar Lee <edgarhinshunlee@gmail.com>
* Update to v1.28.2
Signed-off-by: Johnatas <johnatasr@hotmail.com>
* Bump containerd and stargz versions
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Print message on upgrade fail
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Send Bad Gateway instead of Service Unavailable when tunnel dial fails
Works around new handling for Service Unavailable by apiserver aggregation added in kubernetes/kubernetes#119870
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Add 60 seconds to server upgrade wait to account for delays in apiserver readiness
Also change cleanup helper to ensure upgrade test doesn't pollute the
images for the rest of the tests.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
---------
Signed-off-by: Johnatas <johnatasr@hotmail.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Co-authored-by: Brad Davidson <brad.davidson@rancher.com>
k3s etcd-snapshot save --etcd-s3 ... is creating a local snapshot and uploading it to s3 while k3s etcd-snapshot delete --etcd-s3 ... was deleting the snapshot only on s3 buckets, this commit change the behavior of delete to do it locally and on s3
Signed-off-by: Ian Cardoso <osodracnai@gmail.com>
Wire up a node watch to collect addresses of server nodes, to prevent adding unauthorized SANs to the dynamiclistener cert.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Consolidate CopyFile function
Signed-off-by: Derek Nola <derek.nola@suse.com>
* Copy to File, not destination folder
Signed-off-by: Derek Nola <derek.nola@suse.com>
---------
Signed-off-by: Derek Nola <derek.nola@suse.com>
Only configure enable-aggregator-routing and egress-selector-config-file
if required by egress-selector-mode.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Problem:
When support for etcd was added in 3957142, generation of certificates and keys for etcd was not gated behind use of managed etcd.
Keys are generated and distributed across servers even if managed etcd is not enabled.
Solution:
Allow generation of certificates and keys only if managed etc is enabled. Check config.DisableETCD flag.
Signed-off-by: Bartossh <lenartconsulting@gmail.com>
Need to add a cli flag for this. Also, should probably have config file loading support for the certificate commands.
Signed-off-by: leilei.zhai <leilei.zhai@qingteng.cn>
* Shortcircuit search with help and version flag
Signed-off-by: Derek Nola <derek.nola@suse.com>
* Keep functions seperate
Signed-off-by: Derek Nola <derek.nola@suse.com>
---------
Signed-off-by: Derek Nola <derek.nola@suse.com>
Allows nodes to join the cluster during a webhook outage. This also
enhances auditability by creating Kubernetes events for the deferred
verification.
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
* Move coverage writer into agent and server
* Add coverage report to E2E PR tests
* Add codecov upload to drone
Signed-off-by: Derek Nola <derek.nola@suse.com>
It is no way we can configure the lb image because it is a const value.
It would be better that we make it variable value and we can override
the value like the `helm-controller` job image when compiling k3s/rke2
Signed-off-by: Yuxing Deng <jxfa0043379@hotmail.com>