kubernetes

mirror of https://github.com/kubernetes/kubernetes.git synced 2026-02-20 00:11:21 -05:00

History

Patrick Ohly eb7391688e DRA scheduler: fix another root cause of double device allocation GatherAllocatedState and ListAllAllocatedDevices need to collect information from different sources (allocated devices, in-flight claims), potentially even multiple times (GatherAllocatedState first gets allocated devices, then the capacities). The underlying assumption that nothing bad happens in parallel is not always true. The following log snippet shows how an update of the assume cache (feeding the allocated devices tracker) and in-flight claims lands such that GatherAllocatedState doesn't see the device in that claim as allocated: dra_manager.go:263: I0115 15:11:04.407714 18778] scheduler: Starting GatherAllocatedState ... allocateddevices.go:189: I0115 15:11:04.407945 18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553" dynamicresources.go:1150: I0115 15:11:04.407981 89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" dra_manager.go:201: I0115 15:11:04.408008 89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211" dynamicresources.go:1157: I0115 15:11:04.408044 89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=< { "devices": { "results": [ { "request": "req-1", "driver": "testdra-all-usesallresources-hvs5d.driver", "pool": "worker-5", "device": "worker-5-device-094" } ] }, "nodeSelector": { "nodeSelectorTerms": [ { "matchFields": [ { "key": "metadata.name", "operator": "In", "values": [ "worker-5" ] } ] } ] }, "allocationTimestamp": "2026-01-15T14:11:04Z" } > dra_manager.go:280: I0115 15:11:04.408085 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086" dra_manager.go:280: I0115 15:11:04.408137 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165" default_binder.go:69: I0115 15:11:04.408175 89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5" dra_manager.go:265: I0115 15:11:04.408264 18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} \| len:2>: { Initial state: "worker-5-device-094" is in-flight, not in cache - goroutine #1: starts GatherAllocatedState, copies cache - goroutine #2: adds to assume cache, removes from in-flight - goroutine #1: checks in-flight => device never seen as allocated This is the second reason for double allocation of the same device in two different claims. The other was timing in the assume cache. Both were tracked down with an integration test (separate commit). It did not fail all the time, but enough that regressions should show up as flakes.		2026-01-27 14:34:56 +01:00
..
api	fix(pod/util): typos in getting pod validation options	2025-02-28 22:19:00 +05:30
apis	fix: allow job startTime updates on resume from suspended state	2025-11-05 09:52:53 +01:00
auth	wire in ctx to rbac plugins	2024-09-17 20:04:02 +03:00
capabilities	Add ut coverage for capabilities.Setup (#125395 )	2024-10-17 18:23:03 +01:00
client	Add test to confirm default content type used by core client	2024-10-23 11:35:32 -04:00
cluster/ports
controller	mark QuotaMonitor as not running and invalidate monitors list	2026-01-08 13:49:59 +01:00
controlplane	test: Add emulated-version flag verification in flagz test	2025-02-20 18:54:51 -08:00
credentialprovider	credential provider config: detect typos	2024-10-14 12:23:43 -07:00
features	Add the feature gate `OrderedNamespaceDeletion` for apiserver.	2025-03-03 13:40:33 -08:00
fieldpath
generated	kubelet: use env vars in node log query PS command	2025-01-13 14:25:35 -08:00
kubeapiserver	v1alpha2 LeaseCandidate API	2024-11-08 02:27:19 +00:00
kubectl	DRA: bump API v1alpha2 -> v1alpha3	2024-07-21 17:28:13 +02:00
kubelet	mark device manager as haelthy before it started for the first time	2025-11-07 03:06:43 +00:00
kubemark	remove runonce mode	2024-11-07 19:54:11 +08:00
printers	v1alpha2 LeaseCandidate API	2024-11-08 02:27:19 +00:00
probe	fix: enable nil-compare and error-nil rules from testifylint in module `k8s.io/kubernetes`	2024-09-25 06:02:47 +02:00
proxy	kube-proxy/winkernel: fix stale RemoteEndpoints due to premature clearing of terminatedEndpoints map.	2025-11-06 07:59:08 +00:00
quota/v1	Merge pull request #128407 from ndixita/pod-level-resources	2024-11-08 07:10:50 +00:00
registry	fix: allow job startTime updates on resume from suspended state	2025-11-05 09:52:53 +01:00
routes	Move public key getter to interface	2024-06-25 18:10:08 -04:00
scheduler	DRA scheduler: fix another root cause of double device allocation	2026-01-27 14:34:56 +01:00
security	Copy limited pieces of code we use from runc's apparmor and utils packages	2024-10-22 09:56:22 -04:00
securitycontext	Mask Linux thermal interrupt info in /proc and /sys.	2025-07-16 11:07:17 +02:00
serviceaccount	Isolate mock signer for externaljwt tests	2024-12-12 09:32:11 -05:00
util	Revert "Enforce the Minimum Kernel Version 6.3 for UserNamespacesSupport feature"	2025-05-15 12:26:08 +02:00
volume	Merge pull request #135066 from eltrufas/automated-cherry-pick-of-#133599-upstream-release-1.32	2025-11-19 23:32:02 -08:00
windows/service	Windows node graceful shutdown	2024-11-05 17:46:22 +00:00
.import-restrictions
OWNERS