kubernetes

mirror of https://github.com/kubernetes/kubernetes.git synced 2026-04-15 14:26:33 -04:00

Author	SHA1	Message	Date
Patrick Ohly	dfb8ab6521	DRA scheduler: fail in PreFilter when DRAPrioritizedList is disabled and used This was previously caught during Filter by the allocator check. Doing it sooner avoids wasting resources on a pod which ultimately cannot get scheduled. While at it, be a bit more clear about which feature is disabled. The user might not know that.	2025-03-07 08:45:32 +01:00
Morten Torkildsen	2229a78dfe	DRA: Update allocator for Prioritized Alternatives in Device Requests	2025-02-28 19:30:10 +00:00
Kubernetes Prow Robot	fc268ecd09	Merge pull request #129823 from googs1025/chore/log_improve fix(dra plugin): when there is no resourceclaim, return directly	2025-02-02 16:28:56 -08:00
googs1025	ed826dddfe	fix(dra plugin): when there is no resourceclaim, return directly	2025-01-29 08:47:52 +08:00
Davanum Srinivas	4e05bc20db	Linter to ensure go-cmp/cmp is used ONLY in tests Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2025-01-24 20:49:14 -05:00
Paco Xu	2653caa248	fix dra test lint	2025-01-09 10:42:40 +08:00
googs1025	77eae7c34f	feature(scheduler): remove dra plugin resourceslice QueueingHintFn	2025-01-08 16:24:28 +08:00
Patrick Ohly	33ea278c51	DRA: use v1beta1 API No code is left which depends on the v1alpha3, except of course the code implementing that version.	2024-11-06 13:03:19 +01:00
Kuba Tużnik	8d489425aa	scheduler/dynamicresources: extract obtaining and tracking in-memory modifications of DRA objects All logic related to obtaining DRA objects and tracking modifications to ResourceClaims in-memory is extracted to DefaultDRAManager, which implements framework.SharedDRAManager. This is intended to be a no-op in terms of the DRA plugin behavior.	2024-11-05 14:11:04 +01:00
Patrick Ohly	7863d9a381	DRA scheduler: refactor CEL compilation cache A better place is the cel package because a) the name can become shorter and b) it is tightly coupled with the compiler there. Moving the compilation into the cache simplifies the callers.	2024-11-05 08:34:42 +01:00
Patrick Ohly	6f07fa3a5e	DRA scheduler: update some stale comments	2024-11-01 13:23:42 +01:00
Patrick Ohly	ae6b5522ea	DRA scheduler: rename variable "Allocated devices" are the ones which can be observed from the informer. "All allocated devices" also includes those which are in flight and haven't been written back to the apiserver.	2024-11-01 13:23:42 +01:00
Patrick Ohly	0130ebba1d	DRA scheduler: refactor "allocated devices" lookup The logic for skipping "admin access" was repeated in three different places. A single foreachAllocatedDevices with a callback puts it into one function.	2024-11-01 13:23:28 +01:00
Patrick Ohly	bd7ff9c4c7	DRA scheduler: update some log strings	2024-11-01 13:23:11 +01:00
Patrick Ohly	bc55e82621	DRA scheduler: maintain a set of allocated device IDs Reacting to events from the informer cache (indirectly, through the assume cache) is more efficient than repeatedly listing it's content and then converting to IDs with unique strings. goos: linux goarch: amd64 pkg: k8s.io/kubernetes/test/integration/scheduler_perf cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz │ before │ after │ │ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │ PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 54.70 ± 6% 76.81 ± 6% +40.42% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 106.4 ± 4% 105.6 ± 2% ~ (p=0.413 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 120.0 ± 4% 118.9 ± 7% ~ (p=0.117 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 112.5 ± 4% 105.9 ± 4% -5.87% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 87.13 ± 4% 123.55 ± 4% +41.80% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 113.4 ± 2% 103.3 ± 2% -8.95% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 65.55 ± 3% 121.30 ± 3% +85.05% (p=0.002 n=6) geomean 90.81 106.8 +17.57%	2024-11-01 13:23:06 +01:00
Patrick Ohly	814c9428fd	DRA scheduler: cache compiled CEL expressions DeviceClasses and different requests are very likely to contain the same expression string. We don't need to compile that over and over again. To avoid hanging onto that cache longer than necessary, it's currently tied to each PreFilter/Filter combination. It might make sense to move this up into the scheduler plugin and thus reuse compiled expressions for different pods. goos: linux goarch: amd64 pkg: k8s.io/kubernetes/test/integration/scheduler_perf cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz │ before │ after │ │ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │ PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 33.95 ± 4% 36.65 ± 2% +7.95% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 105.8 ± 2% 106.7 ± 3% ~ (p=0.177 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 100.7 ± 1% 119.7 ± 3% +18.82% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 90.78 ± 1% 121.10 ± 4% +33.40% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 50.51 ± 7% 63.72 ± 3% +26.17% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 103.7 ± 5% 110.2 ± 2% +6.32% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 28.50 ± 2% 28.16 ± 5% ~ (p=0.102 n=6) geomean 64.99 73.15 +12.56%	2024-11-01 13:20:06 +01:00
Patrick Ohly	941d17b3b8	DRA scheduler: code cleanups Looking up the slice can be avoided by storing it when allocating a device. The AllocationResult struct is small enough that it can be copied by value. goos: linux goarch: amd64 pkg: k8s.io/kubernetes/test/integration/scheduler_perf cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz │ before │ after │ │ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │ PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 33.30 ± 2% 33.95 ± 4% ~ (p=0.288 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 105.3 ± 2% 105.8 ± 2% ~ (p=0.524 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 100.8 ± 1% 100.7 ± 1% ~ (p=0.738 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 90.96 ± 2% 90.78 ± 1% ~ (p=0.952 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 49.84 ± 4% 50.51 ± 7% ~ (p=0.485 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 103.8 ± 1% 103.7 ± 5% ~ (p=0.582 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 27.21 ± 7% 28.50 ± 2% ~ (p=0.065 n=6) geomean 64.26 64.99 +1.14%	2024-11-01 13:19:51 +01:00
Patrick Ohly	1246898315	DRA scheduler: ResourceSlice with unique strings Using unique strings instead of normal strings speeds up allocation with structured parameters because maps that use those strings as key no longer need to build hashes of the string content. However, care must be taken to call unique.Make as little as possible because it is costly. Pre-allocating the map of allocated devices reduces the need to grow the map when adding devices. goos: linux goarch: amd64 pkg: k8s.io/kubernetes/test/integration/scheduler_perf cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz │ before │ after │ │ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │ PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 18.06 ± 2% 33.30 ± 2% +84.31% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 104.7 ± 2% 105.3 ± 2% ~ (p=0.818 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 96.62 ± 1% 100.75 ± 1% +4.28% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 83.00 ± 2% 90.96 ± 2% +9.59% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 32.45 ± 7% 49.84 ± 4% +53.60% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 95.22 ± 7% 103.80 ± 1% +9.00% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 9.111 ± 10% 27.215 ± 7% +198.69% (p=0.002 n=6) geomean 45.86 64.26 +40.12%	2024-11-01 13:19:48 +01:00
Patrick Ohly	7de6d070f2	DRA scheduler: avoid listing claims during Filter The Allocate call used to call back into the claim lister for each node. This was significant work which showed up at the top of the CPU profile. It's okay to list only once during PreFilter because the Filter call does not change the claim status between Allocate calls. goos: linux goarch: amd64 pkg: k8s.io/kubernetes/test/integration/scheduler_perf cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz │ before │ after │ │ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │ PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 15.04 ± 0% 18.06 ± 2% +20.07% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 105.5 ± 1% 104.7 ± 2% ~ (p=0.485 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 95.83 ± 1% 96.62 ± 1% ~ (p=0.063 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 79.67 ± 3% 83.00 ± 2% +4.18% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 27.11 ± 5% 32.45 ± 7% +19.68% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 84.00 ± 3% 95.22 ± 7% +13.36% (p=0.002 n=6) PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 7.110 ± 6% 9.111 ± 10% +28.15% (p=0.002 n=6) geomean 41.05 45.86 +11.73%	2024-11-01 12:43:17 +01:00
Kubernetes Prow Robot	223ac36b50	Merge pull request #128399 from JesseStutler/dra Refactor the dynamicResources struct to DynamicResources	2024-11-01 00:33:27 +00:00
Kubernetes Prow Robot	daef8c2419	Merge pull request #127266 from pohly/dra-admin-access-in-status DRA API: AdminAccess in DeviceRequestAllocationResult + DRAAdminAccess feature gate	2024-10-30 03:41:25 +00:00
Patrick Ohly	4419568259	DRA: treat AdminAccess as a new feature gated field Using the "normal" logic for a feature gated field simplifies the implementation of the feature gate. There is one (entirely theoretic!) problem with updating from 1.31: if a claim was allocated in 1.31 with admin access, the status field was not set because it didn't exist yet. If a driver now follows the current definition of "unset = off", then it will not grant admin access even though it should. This is theoretic because drivers are starting to support admin access with 1.32, so there shouldn't be any claim where this problem could occur.	2024-10-29 10:22:31 +01:00
Patrick Ohly	9a7e4ccab2	DRA admin access: add feature gate The new DRAAdminAccess feature gate has the following effects: - If disabled in the apiserver, the spec.devices.requests[*].adminAccess field gets cleared. Same in the status. In both cases the scenario that it was already set and a claim or claim template get updated is special: in those cases, the field is not cleared. Also, allocating a claim with admin access is allowed regardless of the feature gate and the field is not cleared. In practice, the scheduler will not do that. - If disabled in the resource claim controller, creating ResourceClaims with the field set gets rejected. This prevents running workloads which depend on admin access. - If disabled in the scheduler, claims with admin access don't get allocated. The effect is the same. The alternative would have been to ignore the fields in claim controller and scheduler. This is bad because a monitoring workload then runs, blocking resources that probably were meant for production workloads.	2024-10-29 09:50:11 +01:00
Patrick Ohly	f3fef01e79	DRA API: AdminAccess in DeviceRequestAllocationResult Drivers need to know that because admin access may also grant additional permissions. The allocator needs to ignore such results when determining which devices are considered as allocated. In both cases it is conceptually cleaner to not rely on the content of the ClaimSpec.	2024-10-29 09:50:07 +01:00
jessestutler	f7003c76b4	Refactor the dynamicResources struct to DynamicResources	2024-10-29 11:44:42 +08:00
Patrick Ohly	9d1b0654e0	DRA: add wg/device-management label automatically This makes PRs show up automatically in the WG's project board (https://github.com/orgs/kubernetes/projects/95/views/1).	2024-10-28 16:36:04 +01:00
Patrick Ohly	f84eb5ecf8	DRA: remove "classic DRA" This removes the DRAControlPlaneController feature gate, the fields controlled by it (claim.spec.controller, claim.status.deallocationRequested, claim.status.allocation.controller, class.spec.suitableNodes), the PodSchedulingContext type, and all code related to the feature. The feature gets removed because there is no path towards beta and GA and DRA with "structured parameters" should be able to replace it.	2024-10-16 23:09:50 +02:00
googs1025	24a28766d4	chore(scheduler dra): improve dra queue hint unit test	2024-10-01 17:22:15 +08:00
Patrick Ohly	aee77bfc84	DRA scheduler: add special ActionType for ResourceClaim changes Having a dedicated ActionType which only gets used when the scheduler itself already detects some change in the list of generated ResourceClaims of a pod avoids calling the DRA plugin for unrelated Pod changes.	2024-09-27 16:53:58 +02:00
Patrick Ohly	d425353c13	DRA scheduler: reduce verbosity of queuing hints Other hints also only use V(5) or higher.	2024-09-27 08:15:33 +02:00
Patrick Ohly	4a265feb83	DRA scheduler: fix queuing hint support `d66f8f9` added that "plugins have to implement a QueueingHint for Pod/Update event if the rejection from them could be resolved by updating unscheduled Pods itself". This applies to DRA because the name of a generated ResourceClaim must be recorded in the pod status before the pod can be scheduled.	2024-09-27 08:15:33 +02:00
Kensei Nakada	fe3ab649d0	feat: remove node general update event from EventsToRegister when QHint is enabled	2024-09-14 16:15:40 +09:00
Kensei Nakada	24a14aa810	fix: run a test for requeueing with PreFilterResult correctly	2024-09-07 23:52:45 +09:00
Patrick Ohly	e85d3babf0	DRA scheduler: fix re-scheduling after ResourceSlice changes Making unschedulable pods schedulable again after ResourceSlice cluster events was accidentally left out when adding structured parameters to Kubernetes 1.30. All E2E tests were defined so that a driver starts first. A new test with a different order (create pod first, wait for unschedulable, start driver) triggered the bug and now passes.	2024-08-22 10:09:32 +02:00
Patrick Ohly	6dd2ade762	DRA scheduler: reduce log verbosity That a pod with no claims remains unschedulable on claim changes is a pretty normal case. It should only be logged when debugging.	2024-08-22 10:09:32 +02:00
Kubernetes Prow Robot	ea1143efc7	Merge pull request #126022 from macsko/new_node_to_status_map_structure Change structure of NodeToStatus map in scheduler	2024-08-13 21:02:55 -07:00
Maciej Skoczeń	98be7dfc5d	Change structure of NodeToStatus map in scheduler	2024-07-25 07:48:35 +00:00
Patrick Ohly	9f36c8d718	DRA: add DRAControlPlaneController feature gate for "classic DRA" In the API, the effect of the feature gate is that alpha fields get dropped on create. They get preserved during updates if already set. The PodSchedulingContext registration is not restricted by the feature gate. This enables deleting stale PodSchedulingContext objects after disabling the feature gate. The scheduler checks the new feature gate before setting up an informer for PodSchedulingContext objects and when deciding whether it can schedule a pod. If any claim depends on a control plane controller, the scheduler bails out, leading to: Status: Pending ... Warning FailedScheduling 73s default-scheduler 0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. The rest of the changes prepare for testing the new feature separately from "structured parameters". The goal is to have base "dra" jobs which just enable and test those, then "classic-dra" jobs which add DRAControlPlaneController.	2024-07-22 18:09:34 +02:00
Patrick Ohly	599fe605f9	DRA scheduler: adapt to v1alpha3 API The structured parameter allocation logic was written from scratch in staging/src/k8s.io/dynamic-resource-allocation/structured where it might be useful for out-of-tree components. Besides the new features (amount, admin access) and API it now supports backtracking when the initial device selection doesn't lead to a complete allocation of all claims. Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com> Co-authored-by: John Belamaric <jbelamaric@google.com>	2024-07-22 18:09:34 +02:00
Patrick Ohly	91d7882e86	DRA: new API for 1.31 This is a complete revamp of the original API. Some of the key differences: - refocused on structured parameters and allocating devices - support for constraints across devices - support for allocating "all" or a fixed amount of similar devices in a single request - no class for ResourceClaims, instead individual device requests are associated with a mandatory DeviceClass For the sake of simplicity, optional basic types (ints, strings) where the null value is the default are represented as values in the API types. This makes Go code simpler because it doesn't have to check for nil (consumers) and values can be set directly (producers). The effect is that in protobuf, these fields always get encoded because `opt` only has an effect for pointers. The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new "request" field. This is considered acceptable because the entire `claims` field in the pod spec is still alpha. The implementation is complete enough to bring up the apiserver. Adapting other components follows.	2024-07-22 18:09:34 +02:00
Patrick Ohly	8a629b9f15	DRA: remove "sharable" from claim allocation result Now all claims are shareable up to the limit imposed by the size of the "reserverFor" array. This is one of the agreed simplifications for 1.31.	2024-07-21 17:28:14 +02:00
Patrick Ohly	de5742ae83	DRA: remove immediate allocation As agreed in https://github.com/kubernetes/enhancements/pull/4709, immediate allocation is one of those features which can be removed because it makes no sense for structured parameters and the justification for classic DRA is weak.	2024-07-21 17:28:14 +02:00
Patrick Ohly	b51d68bb87	DRA: bump API v1alpha2 -> v1alpha3 This is in preparation for revamping the resource.k8s.io completely. Because there will be no support for transitioning from v1alpha2 to v1alpha3, the roundtrip test data for that API in 1.29 and 1.30 gets removed. Repeating the version in the import name of the API packages is not really required. It was done for a while to support simpler grepping for usage of alpha APIs, but there are better ways for that now. So during this transition, "resourceapi" gets used instead of "resourcev1alpha3" and the version gets dropped from informer and lister imports. The advantage is that the next bump to v1beta1 will affect fewer source code lines. Only source code where the version really matters (like API registration) retains the versioned import.	2024-07-21 17:28:13 +02:00
Jordan Liggitt	03d48b7683	Move CEL env initialization out of package init() This ensures compatibility version and feature gates can be initialized before cached CEL environments are created.	2024-07-19 15:06:48 -04:00
googs1025	a3978e8315	scheduler: Add ctx param and error return to EnqueueExtensions.EventsToRegister()	2024-07-18 12:22:17 +08:00
Kubernetes Prow Robot	ac9aec9f9b	Merge pull request #125116 from pohly/dra-one-of-source DRA: remove "source" indirection from v1 Pod API	2024-06-28 12:46:45 -07:00
Patrick Ohly	bde9b64cdf	DRA: remove "source" indirection from v1 Pod API This makes the API nicer: resourceClaims: - name: with-template resourceClaimTemplateName: test-inline-claim-template - name: with-claim resourceClaimName: test-shared-claim Previously, this was: resourceClaims: - name: with-template source: resourceClaimTemplateName: test-inline-claim-template - name: with-claim source: resourceClaimName: test-shared-claim A more long-term benefit is that other, future alternatives might not make sense under the "source" umbrella. This is a breaking change. It's justified because DRA is still alpha and will have several other API breaks in 1.31.	2024-06-27 17:53:24 +02:00
Patrick Ohly	4bddebc48e	DRA: fix scheduler/resource claim controller race with retry The JSON patch approach works, but it is complex. A retry loop is easier to understand (detect conflict, get new claim, try again). There is one additional API call (the get), but in practice this scenario is unlikely.	2024-06-27 15:03:56 +02:00
Patrick Ohly	ecbafb8de5	DRA: fix scheduler/resource claim controller race There was a race caused by having to update claim finalizer and status in two different operations: - Resource claim controller removes allocation, does not yet get to remove the finalizer. - Scheduler prepares an allocation, without adding the finalizer because it's there. - Controller removes finalizer. - Scheduler adds allocation. This is an invalid state. Automatic checking found this during the execution of the "with translated parameters on single node.*supports sharing a claim sequentially" E2E test, but only when run stand-alone. When running in parallel (as in the CI), the bad outcome of the race did not occur. The fix is to check that the finalizer is still set when adding the allocation. The apiserver doesn't check that because it doesn't know which finalizer goes with the allocation result. It could check for "some finalizer", but that is not guaranteed to be correct (could be some unrelated one). Checking the finalizer can only be done with a JSON patch. Despite the complications, having the ability to add multiple pods concurrently to ReservedFor seems worth it (avoids expensive rescheduling or a local retry loop). The resource claim controller doesn't need this, it can do a normal update which implicitly checks ResourceVersion.	2024-06-27 15:03:06 +02:00
Kubernetes Prow Robot	8c478a06d8	Merge pull request #124595 from pohly/dra-scheduler-assume-cache-eventhandlers DRA: scheduler event handlers via assume cache	2024-06-25 11:56:28 -07:00

1 2 3

114 commits