GatherAllocatedState and ListAllAllocatedDevices need to collect information
from different sources (allocated devices, in-flight claims), potentially even
multiple times (GatherAllocatedState first gets allocated devices, then the
capacities).
The underlying assumption that nothing bad happens in parallel is not always
true. The following log snippet shows how an update of the assume
cache (feeding the allocated devices tracker) and in-flight claims lands such
that GatherAllocatedState doesn't see the device in that claim as allocated:
dra_manager.go:263: I0115 15:11:04.407714 18778] scheduler: Starting GatherAllocatedState
...
allocateddevices.go:189: I0115 15:11:04.407945 18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553"
dynamicresources.go:1150: I0115 15:11:04.407981 89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680"
dra_manager.go:201: I0115 15:11:04.408008 89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211"
dynamicresources.go:1157: I0115 15:11:04.408044 89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=<
{
"devices": {
"results": [
{
"request": "req-1",
"driver": "testdra-all-usesallresources-hvs5d.driver",
"pool": "worker-5",
"device": "worker-5-device-094"
}
]
},
"nodeSelector": {
"nodeSelectorTerms": [
{
"matchFields": [
{
"key": "metadata.name",
"operator": "In",
"values": [
"worker-5"
]
}
]
}
]
},
"allocationTimestamp": "2026-01-15T14:11:04Z"
}
>
dra_manager.go:280: I0115 15:11:04.408085 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086"
dra_manager.go:280: I0115 15:11:04.408137 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165"
default_binder.go:69: I0115 15:11:04.408175 89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5"
dra_manager.go:265: I0115 15:11:04.408264 18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} | len:2>: {
Initial state: "worker-5-device-094" is in-flight, not in cache
- goroutine #1: starts GatherAllocatedState, copies cache
- goroutine #2: adds to assume cache, removes from in-flight
- goroutine #1: checks in-flight
=> device never seen as allocated
This is the second reason for double allocation of the same device in two
different claims. The other was timing in the assume cache. Both were
tracked down with an integration test (separate commit). It did not fail
all the time, but enough that regressions should show up as flakes.
In kubelet admission:
- Remove extended resources from pod requirements if they are either
backed by DRA or not present in node's allocatable resources
In scheduler (fit.go):
- Remove fallback logic that delegated all resources to DRA when
draManager is nil
These changes ensure that:
- DRA-backed extended resources are properly handled during pod admission
- DevicePlugin-backed extended resources still follow standard admission rules
Test_isSchedulableAfterClaimChange was sensitive to system load because of the
arbitrary delay when waiting for the assume cache to catch up. Running inside
a synctest bubble avoids this. While at it, the unit tests get converted
to ktesting (nicer failure output, no extra indention needed for
tCtx.SyncTest).
TestPlugin/prebind-fail-with-binding-timeout relied on setting up a claim with
certain time stamps and then getting that test case tested within a certain
real-world time window. It's surprising that this didn't flake more often
because test execution order is random. Now the time stamp gets set right
before the test case is about to be tested. Conversion to a synctest would
be nicer, but synctests cannot have sub-tests, which are used here to track
where log output and failures come from within the larger test case.
Inside the plugin itself some log output gets added to explain why a claim is
unavailable on a node in case of a binding timeout or error during Filter.
The tests assumed that instantiating a DRAManager followed by
informerFactory.WaitForCacheSync would be enough to have the manager
up-to-date, but that's not correct: the test only waits for informer *caches*
to be synced, but syncing *event handlers* like the one in the manager may
still be going on. The flake rate is low, though:
$ GOPATH/bin/stress -p 256 ./noderesources.test
5s: 0 runs so far, 0 failures, 256 active
10s: 256 runs so far, 0 failures, 256 active
15s: 256 runs so far, 0 failures, 256 active
20s: 512 runs so far, 0 failures, 256 active
25s: 567 runs so far, 0 failures, 256 active
30s: 771 runs so far, 0 failures, 256 active
/tmp/go-stress-20251226T181044-974980161
--- FAIL: TestCalculateResourceAllocatableRequest (0.81s)
--- FAIL: TestCalculateResourceAllocatableRequest/DRA-backed-resource-with-shared-device-allocation (0.00s)
extendedresourcecache.go:197: I1226 18:11:14.431337] Updated extended resource cache for explicit mapping extendedResource="extended.resource.dra.io/something" deviceClass="device-class-name"
extendedresourcecache.go:204: I1226 18:11:14.431380] Updated extended resource cache for default mapping extendedResource="deviceclass.resource.kubernetes.io/device-class-name" deviceClass="device-class-name"
extendedresourcecache.go:220: I1226 18:11:14.431394] Updated device class mapping deviceClass="device-class-name" extendedResource="extended.resource.dra.io/something"
resource_allocation_test.go:595: Expected requested=2, but got requested=1
FAIL
It becomes higher when changing WaitForCacheSync such that it doesn't poll and
therefore returns more promptly, which is where this flake was first observed.
The fix is to run the test in a syntest bubble where Wait can be used to wait
for all background activity, including event handling, to be finished before
proceeding with the test.
synctest is less forgiving about lingering goroutines. A synctest bubble must
wait for gouroutines to stop, which in this case means that there has to be
a way to wait for the metric recorder shutdown. Event handlers have to be
removed.
This could be done with plain Go, but here test/utils/ktesting is used instead
because it offers some advantages:
- less boilerplate code
- automatic cancellation of the context (i.e. less manual context.WithCancel)
- tCtx.SyncTest is a direct substitute for t.Run, which avoids re-indenting
sub-tests. synctest itself needs another anonymous function, which makes
the line too long and forced re-indention:
t.Run(... func(...) {
synctest.Test(... func() {
})
})
For the sake of consistency all tests get updated.
While at it, some code gets improved:
- t.Fatal(err) is not a good way to report an error because
there is no additional markup in the test output that indicates
that there was an unexpected error. It just logs err.Error(),
which might not be very informative and/or obvious.
- newTestDRAManager aborts in case of a failure instead of
returning an error.
This has been replaced by `//build:...` for a long time now.
Removal of the old build tag was automated with:
for i in $(git grep -l '^// +build' | grep -v -e '^vendor/'); do if ! grep -q '^// Code generated' "$i"; then sed -i -e '/^\/\/ +build/d' "$i"; fi; done
For debugging double allocation of the same
device (https://github.com/kubernetes/kubernetes/issues/133602) it is necessary
to have information about pools, devices and in-flight claims. Log calls get
extended and the config for DRA CI jobs updated to enable higher verbosity for
relevant source files.
Log output in such a cluster at verbosity 6 looks like this:
I1215 10:28:54.166872 1 allocator_incubating.go:130] "Gathered pool information" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" pools={"count":1,"devices":["dra-8841.k8s.io/kind-worker2/device-00"],"meta":[{"InvalidReason":"","id":"dra-8841.k8s.io/kind-worker2","isIncomplete":false,"isInvalid":false}]}
I1215 10:28:54.166941 1 allocator_incubating.go:254] "Gathered information about devices" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" allocatedDevices={"count":2,"devices":["dra-8841.k8s.io/kind-worker/device-00","dra-8841.k8s.io/kind-worker3/device-00"]} minDevicesToBeAllocated=1
Add counter metric to track pods that schedule immediately after
being flushed from unschedulablePods due to timeout. Uses a boolean
flag that is cleared when pods return to queue or move via events.
This metric tracks pods that successfully schedule after being
flushed from unschedulablePods due to timeout. High values may
indicate missing queue hint optimizations or event handling issues.
The `removeSlice` function was leaving behind references to the
removed element, preventing it from being garbage-collected.
This commit ensures that removed entries are fully cleared,
eliminating the memory leak.
Co-authored-by: ravisastryk <ravisastryk@gmail.com>
Signed-off-by: Sujal Shah <sujalshah28092004@gmail.com>
* First version of batching w/out signatures.
* First version of pod signatures.
* Integrate batching with signatures.
* Fix merge conflicts.
* Fixes from self-review.
* Test fixes.
* Fix a bug that limited batches to size 2
Also add some new high-level logging and
simplify the pod affinity signature.
* Re-enable batching on perf tests for now.
* fwk.NewStatus(fwk.Success)
* Review feedback.
* Review feedback.
* Comment fix.
* Two plugin specific unit tests.:
* Add cycle state to the sign call, apply to topo spread.
Also add unit tests for several plugi signature
calls.
* Review feedback.
* Switch to distinct stats for hint and store calls.
* Switch signature from string to []byte
* Revert cyclestate in signs. Update node affinity.
Node affinity now sorts all of the various
nested arrays in the structure. CycleState no
longer in signature; revert to signing fewer
cases for pod spread.
* hack/update-vendor.sh
* Disable signatures when extenders are configured.
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update staging/src/k8s.io/kube-scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Disable node resource signatures when extended DRA enabled.
* Review feedback.
* Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Fixes for review suggestions.
* Add integration tests.
* Linter fixes, test fix.
* Whitespace fix.
* Remove broken test.
* Unschedulable test.
* Remove go.mod changes.
---------
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>