This commit introduces the DRAResourceClaimGranularStatusAuthorization
feature gate (Beta in 1.36) to enforce fine-grained authorization checks
on ResourceClaim status updates.
Previously, 'update' permission on 'resourceclaims/status' allowed modifying
the entire status. To enforce the principle of least privilege for DRA
drivers and the scheduler, this change introduces synthetic subresources and
verb prefixes:
- 'resourceclaims/binding': Required to update 'status.allocation' and
'status.reservedFor'.
- 'resourceclaims/driver': Required to update 'status.devices'. Evaluated
on a per-driver basis using 'associated-node:<verb>' (for node-local
ServiceAccounts) or 'arbitrary-node:<verb>' (for cluster-wide controllers).
Add unit tests for handwritten and declarative validation, controller
logic, metrics, table printer output, controller-manager registration,
etcd storage round-trip, and an integration test for the full RPSR
lifecycle. Also add an e2e test exercising the DRA test driver with
RPSR and the example manifest.
The fields become beta, enabled by default. DeviceTaintRule gets
added to the v1beta2 API, but support for it must remain off by default
because that API group is also off by default.
The v1beta1 API is left unchanged. No-one should be using it
anymore (deprecated in 1.33, could be removed now if it wasn't for
reading old objects and version emulation).
To achieve consistent validation, declarative validation must be enabled also
for v1alpha3 (was already enabled for other versions). Otherwise,
TestVersionedValidationByFuzzing fails:
--- FAIL: TestVersionedValidationByFuzzing (0.09s)
--- FAIL: TestVersionedValidationByFuzzing/resource.k8s.io/v1beta2,_Kind=DeviceTaintRule (0.00s)
validation_test.go:109: different error count (0 vs. 1)
resource.k8s.io/v1alpha3: <no errors>
resource.k8s.io/v1beta2: "spec.taint.effect: Unsupported value: \"幤HxÒQP¹¬永唂ȳ垞ş]嘨鶊\": supported values: \"NoExecute\", \"NoSchedule\", \"None\""
...
Update outdated image versions across test manifests and add tracking
to build/dependencies.yaml for version drift detection via zeitgeist:
- agnhost: 2.32/2.53/2.54/2.57 → 2.61 (latest)
- etcd: 3.2.24 → 3.6.7-0
- kitten/nautilus BASEIMAGE: agnhost 2.57 → 2.61
and added etcd statefulset reference to existing etcd entry.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
As usual, consumers of an allocated claim react to the information stored in
the status. In this case, the scheduler did not copy the tolerations into the
status and as a result a pod with a toleration for NoExecute got scheduled and
then immediately evicted.
Some additional logging gets added to make the handling easier to track in the
eviction controller. Example YAMLs allow reproducing the use case manually.
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.
However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
The helper code is useful for a separate Ginkgo suite for upgrade/downgrade
testing. We don't want to import test/e2e/dra there because that would also
define additional tests.
YAML files were patched with:
sed -i -e 's;registry.k8s.io/e2e-test-images/agnhost:2...;registry.k8s.io/e2e-test-images/agnhost:2.54;' $(git grep -l agnhost:2 test/e2e/testing-manifests/ test/fixtures/)
The test/images/kitten and test/images/nautilus base images are still on an
older agnhost because updating those is better left to the owners.
`terminationGracePeriodSeconds: 0` was a mistake, it bypasses the normal
pod shutdown in the kubelet.
The right way to shut down a pod quickly is to have it react to SIGTERM.
The busybox implementation of "sleep" doesn't. `agnhost pause` does,
so let's use that instead.
For E2E tests, the InfiniteSleepCommand was already change about a year ago to
react to SIGTERM, so the `terminationGracePeriodSeconds: 1` workaround is no
longer needed.
Some more out-dated reference to resource class. Keeping the pod running is
better for demonstrating the lifecycle of claims because it is actually
possible to see a claim in the allocated state.
This adds the "DeviceTaint" top-level type to v1alpha3 and related fields to
ResourceSlice and ResourceClaim. It's complete enough bring up an API server
and generate files.
This fixes the message (node name and "cluster-scoped" were switched) and
simplifies the VAP:
- a single matchCondition short circuits completely unless they're a user
we care about
- variables to extract the userNodeName and objectNodeName once
(using optionals to gracefully turn missing claims and fields into empty strings)
- leaves very tiny concise validations
Co-authored-by: Jordan Liggitt <liggitt@google.com>
The advantages of using a validation admission policy (VAP) are that no changes
are needed in Kubernetes and that admins have full flexibility if and how they
want to control which users are allowed to use "admin access" in their
requests.
The downside is that without admins taking actions, the feature is enabled
out-of-the-box in a cluster. Documentation for DRA will have to make it very
clear that something needs to be done in multi-tenant clusters.
The test/e2e/testing-manifests/dra/admin-access-policy.yaml shows how to do
this. The corresponding E2E tests ensures that it actually works as intended.
For some reason, adding the namespace to the message expression leads to a
type check errors, so it's currently commented out.
As agreed in https://github.com/kubernetes/enhancements/pull/4709, immediate
allocation is one of those features which can be removed because it makes no
sense for structured parameters and the justification for classic DRA is weak.
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.
Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.
Only source code where the version really matters (like API registration)
retains the versioned import.
In reality, the kubelet plugin of a DRA driver is meant to be deployed as a
daemonset with a service account that limits its
permissions. https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#additional-metadata-in-pod-bound-tokens
ensures that the node name is bound to the pod, which then can be used
in a validating admission policy (VAP) to ensure that the operations are
limited to the node.
In E2E testing, we emulate that via impersonation. This ensures that the plugin
does not accidentally depend on additional permissions.
This makes the API nicer:
resourceClaims:
- name: with-template
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
resourceClaimName: test-shared-claim
Previously, this was:
resourceClaims:
- name: with-template
source:
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
source:
resourceClaimName: test-shared-claim
A more long-term benefit is that other, future alternatives
might not make sense under the "source" umbrella.
This is a breaking change. It's justified because DRA is still
alpha and will have several other API breaks in 1.31.
In the Dynamic Resource allocation example specs, the claim
parameter name specified was inconsistent.
This commit fixes that with a better/more consistent name,
which is used to define the configmap and referenced in
the `ResourceClaimTemplate` spec.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The driver can be used manually against a cluster started with
local-up-cluster.sh and is also used for E2E testing. Because the tests proxy
connections from the nodes into the e2e.test binary and create/delete files via
the equivalent of "kubectl exec dd/rm", they can be run against arbitrary
clusters. Each test gets its own driver instance and resource class, therefore
they can run in parallel.