IDsPerPod is the mapping length of subids for UserNS.
The length must be multiple of 65536.
Default: 65536
Implements kubernetes/enhancements PR 5020 (addendum to KEP-127)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The tests and comments have also been updated because while
VolumeCapacityPriority preferred a node with the least allocatable,
StorageCapacityScoring preferred a node with the maximum allocatable.
- Add new version fields to version.Info struct:
* EmulationMajor and EmulationMinor to track emulated version
* MinCompatibilityMajor and MinCompatibilityMinor for compatibility tracking
- Update related code to populate and use these new fields
- Improve version information documentation and OpenAPI generation
- Modify version routes and documentation to reflect new version information structure
Graduate the feature to beta, by:
- Allowing `subPath`/`subPathExpr` for image volumes
- Modifying the CRI to pass down the (resolved) sub path
- Adding metrics which are outlined in the KEP
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
- Use environment variables to pass string arguments in the node log
query PS command
- Split getLoggingCmd into getLoggingCmdEnv and getLoggingCmdArgs
for better modularization
The original limit of 32 seemed sufficient for a single GPU on a node. But for
shared non-local resources it is too low. For example, a ResourceClaim might be
used to allocate an interconnect channel that connects all pods of a workload
running on several different nodes, in which case the number of pods can be
considerably larger.
256 is high enough for currently planned systems. If we need something even
higher in the future, an alternative approach might be needed to avoid
scalability problems.
Normally, increasing such a limit would have to be done incrementally over two
releases. In this case we decided on
Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to
make an exception and apply this change to current master for 1.33 and backport
it to the next 1.32.x patch release for production usage.
This breaks downgrades to a 1.32 release without this change if there are
ResourceClaims with a number of consumers > 32 in ReservedFor. In practice,
this breakage is very unlikely because there are no workloads yet which need so
many consumers and such downgrades to a previous patch release are also
unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.
* Add feature gate, API, and conflict validation tests for enablecrashloopbackoffmax
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Handle when current base is longer than node max
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Update pkg/features/kube_features.go
Co-authored-by: Tsubasa Nagasawa <toversus2357@gmail.com>
* Fix indentation
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Follow convention for success test
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Normalize casing, and change field to Duration
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Fix json name and some other casing errors
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Another one I missed before
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Don't clobber global max function
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Change to flat value in defaults.go
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Streamline validation and defaults
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Fix typecheck
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Lint
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Tighten up validation for subsecond values
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Rename field from MaxBackOffPeriod to MaxContainerRestartPeriod
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* A few missed references to renames
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Only compare flags in flags test
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Don't mess with SetDefault signature
Nobody messes with SetDefault signature
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Fix stale signature change, and update test data
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Inspect current feature gates at defaulting time
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Don't use the global feature gate for temp usage
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Expose default error, and some comments
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
* Hint fuzzer for less arbitrary values to FeatureGates
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
---------
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Co-authored-by: Tsubasa Nagasawa <toversus2357@gmail.com>
This had been left out unintentionally earlier. Because theoretically there
might now be existing objects with parameters that are larger than whatever
limit gets enforced now, the limit only gets checked when parameters get
created or modified.
This is similar to the validation of CEL expressions and for consistency, the
same 10 Ki limit as for those is chosen.
Because the limit is not enforced for stored parameters, it can be increased in
the future, with the caveat that users who need larger parameters then depend
on the newer Kubernetes release with a higher limit. Lowering the limit is
harder because creating deployments that worked in older Kubernetes will not
work anymore with newer Kubernetes.
This enables a future extension where capacity of a single device gets consumed
by different claims. The semantic without any additional fields is the same as
before: a capacity cannot be split up and is only an attribute of a device.
Because its semantically the same as before, two-way conversion to v1alpha3 is
possible.
Using the "normal" logic for a feature gated field simplifies the
implementation of the feature gate.
There is one (entirely theoretic!) problem with updating from 1.31: if a claim
was allocated in 1.31 with admin access, the status field was not set because
it didn't exist yet. If a driver now follows the current definition of "unset =
off", then it will not grant admin access even though it should. This is
theoretic because drivers are starting to support admin access with 1.32, so
there shouldn't be any claim where this problem could occur.
The new DRAAdminAccess feature gate has the following effects:
- If disabled in the apiserver, the spec.devices.requests[*].adminAccess
field gets cleared. Same in the status. In both cases the scenario
that it was already set and a claim or claim template get updated
is special: in those cases, the field is not cleared.
Also, allocating a claim with admin access is allowed regardless of the
feature gate and the field is not cleared. In practice, the scheduler
will not do that.
- If disabled in the resource claim controller, creating ResourceClaims
with the field set gets rejected. This prevents running workloads
which depend on admin access.
- If disabled in the scheduler, claims with admin access don't get
allocated. The effect is the same.
The alternative would have been to ignore the fields in claim controller and
scheduler. This is bad because a monitoring workload then runs, blocking
resources that probably were meant for production workloads.
Drivers need to know that because admin access may also grant additional
permissions. The allocator needs to ignore such results when determining which
devices are considered as allocated.
In both cases it is conceptually cleaner to not rely on the content of the
ClaimSpec.
The main purpose is to protect against denial-of-service attacks. Scheduling
time depends a lot on unpredictable factors and expected scheduling time also
varies, so no attempt is made to limit the overall time spent on evaluating CEL
expressions per claim.
This removes the DRAControlPlaneController feature gate, the fields controlled
by it (claim.spec.controller, claim.status.deallocationRequested,
claim.status.allocation.controller, class.spec.suitableNodes), the
PodSchedulingContext type, and all code related to the feature.
The feature gets removed because there is no path towards beta and GA and DRA
with "structured parameters" should be able to replace it.
Fix typo and grammar in comments that get reflected through to the
generated documentation, regarding VolumeAttachments' use of
PersistentVolumes and PersistentVolumeClaims.
Signed-off-by: Eric Blake <eblake@redhat.com>
This is a complete revamp of the original API. Some of the key
differences:
- refocused on structured parameters and allocating devices
- support for constraints across devices
- support for allocating "all" or a fixed amount
of similar devices in a single request
- no class for ResourceClaims, instead individual
device requests are associated with a mandatory
DeviceClass
For the sake of simplicity, optional basic types (ints, strings) where the null
value is the default are represented as values in the API types. This makes Go
code simpler because it doesn't have to check for nil (consumers) and values
can be set directly (producers). The effect is that in protobuf, these fields
always get encoded because `opt` only has an effect for pointers.
The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new
"request" field. This is considered acceptable because the entire `claims`
field in the pod spec is still alpha.
The implementation is complete enough to bring up the apiserver.
Adapting other components follows.
As agreed in https://github.com/kubernetes/enhancements/pull/4709, immediate
allocation is one of those features which can be removed because it makes no
sense for structured parameters and the justification for classic DRA is weak.
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.
Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.
Only source code where the version really matters (like API registration)
retains the versioned import.