Commit graph

2003 commits

Author SHA1 Message Date
Ania Borowiec
48c4605408
Add logging error when UpdatePod finds no existing PodGroup with the pod to update 2026-01-27 11:42:03 +00:00
Kubernetes Prow Robot
efc15394a1
Merge pull request #135573 from brejman/issue-129733-score-update
Update scoring function for balanced allocation to consider change to the node's balance
2026-01-26 21:49:52 +05:30
Kubernetes Prow Robot
53b29a3a2c
Merge pull request #136269 from pohly/dra-scheduler-double-allocation-fixes
DRA scheduler: double allocation fixes
2026-01-26 20:59:50 +05:30
Patrick Ohly
581ee0a2ec DRA scheduler: fix another root cause of double device allocation
GatherAllocatedState and ListAllAllocatedDevices need to collect information
from different sources (allocated devices, in-flight claims), potentially even
multiple times (GatherAllocatedState first gets allocated devices, then the
capacities).

The underlying assumption that nothing bad happens in parallel is not always
true. The following log snippet shows how an update of the assume
cache (feeding the allocated devices tracker) and in-flight claims lands such
that GatherAllocatedState doesn't see the device in that claim as allocated:

    dra_manager.go:263: I0115 15:11:04.407714      18778] scheduler: Starting GatherAllocatedState
    ...
    allocateddevices.go:189: I0115 15:11:04.407945      18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553"
    dynamicresources.go:1150: I0115 15:11:04.407981      89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680"
    dra_manager.go:201: I0115 15:11:04.408008      89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211"
    dynamicresources.go:1157: I0115 15:11:04.408044      89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=<
        	{
        	  "devices": {
        	    "results": [
        	      {
        	        "request": "req-1",
        	        "driver": "testdra-all-usesallresources-hvs5d.driver",
        	        "pool": "worker-5",
        	        "device": "worker-5-device-094"
        	      }
        	    ]
        	  },
        	  "nodeSelector": {
        	    "nodeSelectorTerms": [
        	      {
        	        "matchFields": [
        	          {
        	            "key": "metadata.name",
        	            "operator": "In",
        	            "values": [
        	              "worker-5"
        	            ]
        	          }
        	        ]
        	      }
        	    ]
        	  },
        	  "allocationTimestamp": "2026-01-15T14:11:04Z"
        	}
         >
    dra_manager.go:280: I0115 15:11:04.408085      18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086"
    dra_manager.go:280: I0115 15:11:04.408137      18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165"
    default_binder.go:69: I0115 15:11:04.408175      89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5"
    dra_manager.go:265: I0115 15:11:04.408264      18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} | len:2>: {

Initial state: "worker-5-device-094" is in-flight, not in cache
- goroutine #1: starts GatherAllocatedState, copies cache
- goroutine #2: adds to assume cache, removes from in-flight
- goroutine #1: checks in-flight

=> device never seen as allocated

This is the second reason for double allocation of the same device in two
different claims. The other was timing in the assume cache. Both were
tracked down with an integration test (separate commit). It did not fail
all the time, but enough that regressions should show up as flakes.
2026-01-26 15:44:48 +01:00
Kubernetes Prow Robot
584add12b6
Merge pull request #136457 from tosi3k/workload-helper
Extract helper methods from gang scheduling plugin
2026-01-26 20:01:51 +05:30
Bartosz
56ca09911f
Refactor resource allocation tests to be more readable 2026-01-26 14:26:46 +00:00
Bartosz
8f5f69bc70
Change scoring function for balanced allocation 2026-01-26 14:22:46 +00:00
Antoni Zawodny
8b39544d60 Extract helper methods from gang scheduling plugin 2026-01-26 13:45:26 +01:00
Kubernetes Prow Robot
0af247eb14
Merge pull request #136344 from brejman/kep-5732-tas-rename-podgroupinfo
Rename PodGroupInfo in preparation for Workload-aware scheduling changes
2026-01-23 17:37:29 +05:30
Bartosz
ae27a49a13
Rename PodGroupInfo to PodGroupState
This is in preparation for PodGroupInfo struct with more pod group
details
2026-01-22 14:45:40 +00:00
Kubernetes Prow Robot
cb077823fb
Merge pull request #136204 from romanbaron/remove-cache-ttl
Remove cache expiration mechanism
2026-01-20 03:42:50 +05:30
carlory
c8fc0a1b98 remove CSIMigrationPortworx and InTreePluginPortworxUnregister feature gates
Signed-off-by: carlory <baofa.fan@daocloud.io>
2026-01-19 11:35:29 +08:00
Roman Baron
74b7ff3c63 scheduler: Remove ttl parameter from cache.New signature 2026-01-13 17:04:06 +02:00
Antoni Zawodny
833b7205fc Run PreBind plugins in parallel if feasible 2026-01-11 14:19:18 +01:00
Antoni Zawodny
16b375e4ef Generalize ErrorChannel to other underlying types 2026-01-11 13:58:06 +01:00
Kubernetes Prow Robot
b54554b72d
Merge pull request #135955 from utam0k/async-metrics
scheduler: align the meaning of victim metrics between async preemption and sync preemption
2026-01-08 20:39:41 +05:30
utam0k
44e0c79406
Align the meaning of victim metrics between async preemption and sync preemption
Signed-off-by: utam0k <k0ma@utam0k.jp>
2026-01-08 21:02:17 +09:00
Kubernetes Prow Robot
8ab1bc1633
Merge pull request #135725 from bart0sh/PR211-add-extended-resources-test-cases
Fix extended resource handling for DRA-backed resources on pod admission
2026-01-08 04:03:42 +05:30
Kubernetes Prow Robot
4e69edd0ee
Merge pull request #135392 from brejman/issue-134393-nominated-nodes
Fix queue hint for plugins on change to pods with nominated nodes
2026-01-07 20:05:38 +05:30
Kubernetes Prow Robot
b2ac9e206f
Merge pull request #130231 from Barakmor1/updateimagelocality
Update ImageLocality plugin to account ImageVolume images
2026-01-05 12:28:37 +05:30
Ed Bartosh
c2361491f5 Fix extended resource handling for DRA-backed resources
In kubelet admission:
   - Remove extended resources from pod requirements if they are either
     backed by DRA or not present in node's allocatable resources

In scheduler (fit.go):
   - Remove fallback logic that delegated all resources to DRA when
     draManager is nil

These changes ensure that:
- DRA-backed extended resources are properly handled during pod admission
- DevicePlugin-backed extended resources still follow standard admission rules
2026-01-02 16:08:49 +02:00
Patrick Ohly
dfa6aa22b2 DRA scheduler: fix unit test flakes
Test_isSchedulableAfterClaimChange was sensitive to system load because of the
arbitrary delay when waiting for the assume cache to catch up. Running inside
a synctest bubble avoids this. While at it, the unit tests get converted
to ktesting (nicer failure output, no extra indention needed for
tCtx.SyncTest).

TestPlugin/prebind-fail-with-binding-timeout relied on setting up a claim with
certain time stamps and then getting that test case tested within a certain
real-world time window. It's surprising that this didn't flake more often
because test execution order is random. Now the time stamp gets set right
before the test case is about to be tested. Conversion to a synctest would
be nicer, but synctests cannot have sub-tests, which are used here to track
where log output and failures come from within the larger test case.

Inside the plugin itself some log output gets added to explain why a claim is
unavailable on a node in case of a binding timeout or error during Filter.
2025-12-30 11:45:02 +01:00
Kubernetes Prow Robot
3226fe520d
Merge pull request #135948 from pohly/dra-scheduler-resource-plugin-unit-test-fix
DRA extended resources: fix flake in unit tests
2025-12-30 16:12:35 +05:30
Kubernetes Prow Robot
2a3a6605ac
Merge pull request #135330 from sujalshah-bit/fix-mem-leak
scheduler: Fix memory leak in scheduler cache
2025-12-29 15:56:34 +05:30
Patrick Ohly
7a4d650125 DRA extended resources: fix flake in unit tests
The tests assumed that instantiating a DRAManager followed by
informerFactory.WaitForCacheSync would be enough to have the manager
up-to-date, but that's not correct: the test only waits for informer *caches*
to be synced, but syncing *event handlers* like the one in the manager may
still be going on. The flake rate is low, though:

    $ GOPATH/bin/stress -p 256 ./noderesources.test
    5s: 0 runs so far, 0 failures, 256 active
    10s: 256 runs so far, 0 failures, 256 active
    15s: 256 runs so far, 0 failures, 256 active
    20s: 512 runs so far, 0 failures, 256 active
    25s: 567 runs so far, 0 failures, 256 active
    30s: 771 runs so far, 0 failures, 256 active

    /tmp/go-stress-20251226T181044-974980161
    --- FAIL: TestCalculateResourceAllocatableRequest (0.81s)
        --- FAIL: TestCalculateResourceAllocatableRequest/DRA-backed-resource-with-shared-device-allocation (0.00s)
            extendedresourcecache.go:197: I1226 18:11:14.431337] Updated extended resource cache for explicit mapping extendedResource="extended.resource.dra.io/something" deviceClass="device-class-name"
            extendedresourcecache.go:204: I1226 18:11:14.431380] Updated extended resource cache for default mapping extendedResource="deviceclass.resource.kubernetes.io/device-class-name" deviceClass="device-class-name"
            extendedresourcecache.go:220: I1226 18:11:14.431394] Updated device class mapping deviceClass="device-class-name" extendedResource="extended.resource.dra.io/something"
            resource_allocation_test.go:595: Expected requested=2, but got requested=1
    FAIL

It becomes higher when changing WaitForCacheSync such that it doesn't poll and
therefore returns more promptly, which is where this flake was first observed.

The fix is to run the test in a syntest bubble where Wait can be used to wait
for all background activity, including event handling, to be finished before
proceeding with the test.

synctest is less forgiving about lingering goroutines. A synctest bubble must
wait for gouroutines to stop, which in this case means that there has to be
a way to wait for the metric recorder shutdown. Event handlers have to be
removed.

This could be done with plain Go, but here test/utils/ktesting is used instead
because it offers some advantages:
- less boilerplate code
- automatic cancellation of the context (i.e. less manual context.WithCancel)
- tCtx.SyncTest is a direct substitute for t.Run, which avoids re-indenting
  sub-tests. synctest itself needs another anonymous function, which makes
  the line too long and forced re-indention:
     t.Run(... func(...) {
         synctest.Test(... func() {
         })
     })

For the sake of consistency all tests get updated.

While at it, some code gets improved:

- t.Fatal(err) is not a good way to report an error because
  there is no additional markup in the test output that indicates
  that there was an unexpected error. It just logs err.Error(),
  which might not be very informative and/or obvious.
- newTestDRAManager aborts in case of a failure instead of
  returning an error.
2025-12-27 09:47:56 +01:00
Bartosz
3b4f0be6e3
Check NominatedNodeName to decide if a pod is scheduled 2025-12-19 12:30:06 +00:00
Patrick Ohly
ad79e479c2 build: remove deprecated '// +build' tag
This has been replaced by `//build:...` for a long time now.

Removal of the old build tag was automated with:

    for i in $(git grep -l '^// +build' | grep -v -e '^vendor/'); do if ! grep -q '^// Code generated' "$i"; then sed -i -e '/^\/\/ +build/d' "$i"; fi; done
2025-12-18 12:16:21 +01:00
Kubernetes Prow Robot
a504b1b4eb
Merge pull request #135755 from pohly/dra-logging
DRA: log more information
2025-12-18 02:10:38 -08:00
bmordeha
6f57f1e95b Update imageLocality plugin
to account for ImageVolume images when scoring
and prioritizing nodes with required pod images

Signed-off-by: bmordeha <bmordeha@redhat.com>
2025-12-18 09:28:39 +02:00
Kubernetes Prow Robot
4a1cbabadd
Merge pull request #135495 from tosi3k/skip-last-pod-deletion
Skip last victim in async preemption if any prior Pod preemption failed
2025-12-17 22:36:28 -08:00
Kubernetes Prow Robot
62db4db266
Merge pull request #135489 from ania-borowiec/update_comment
Update async preemption comment to reflect the current state of the code
2025-12-17 22:36:13 -08:00
Kubernetes Prow Robot
c5a0c31294
Merge pull request #135484 from bart0sh/PR209-improve-balanced-allocation-coverage
Extended resources unit tests: cover DRA resources
2025-12-17 22:36:06 -08:00
Kubernetes Prow Robot
1a3d8712f3
Merge pull request #135394 from brejman/adhoc-interpodaffinity-pending-pod-update
Fix queue hint for interpodaffinity when target pod is updated
2025-12-17 21:42:46 -08:00
Kubernetes Prow Robot
285eb9fdba
Merge pull request #135325 from brejman/issue-134393
Fix queue hint for inter-pod anti-affinity
2025-12-17 20:01:02 -08:00
Bartosz
d6d8639349
Fix queue hint for interpod antiaffinity 2025-12-16 13:01:15 +00:00
Bartosz
145adcd522
Fix queue hint for interpodaffinity when target pod is updated 2025-12-16 12:57:50 +00:00
Patrick Ohly
5d536bfb8e DRA: log more information
For debugging double allocation of the same
device (https://github.com/kubernetes/kubernetes/issues/133602) it is necessary
to have information about pools, devices and in-flight claims. Log calls get
extended and the config for DRA CI jobs updated to enable higher verbosity for
relevant source files.

Log output in such a cluster at verbosity 6 looks like this:

I1215 10:28:54.166872       1 allocator_incubating.go:130] "Gathered pool information" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" pools={"count":1,"devices":["dra-8841.k8s.io/kind-worker2/device-00"],"meta":[{"InvalidReason":"","id":"dra-8841.k8s.io/kind-worker2","isIncomplete":false,"isInvalid":false}]}
I1215 10:28:54.166941       1 allocator_incubating.go:254] "Gathered information about devices" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" allocatedDevices={"count":2,"devices":["dra-8841.k8s.io/kind-worker/device-00","dra-8841.k8s.io/kind-worker3/device-00"]} minDevicesToBeAllocated=1
2025-12-16 09:58:05 +01:00
Ed Bartosh
1820dc7535 Fit tests: add DRA-aware test cases 2025-12-12 15:48:18 +02:00
Ed Bartosh
7860effc2c resourceAllocationScorer: add unit test for DRA nodeMatches 2025-12-12 15:48:13 +02:00
Ed Bartosh
02a39d6c1e Balanced allocation tests: cover DRA resources
- Added DRA-aware test cases
- Pulled shared DRA setup out into helper to keep tests DRY
- Added SignPod test
2025-12-12 13:51:19 +02:00
Antoni Zawodny
7577f84e79 Skip last victim in async preemption if any prior Pod preemption failed 2025-12-10 14:44:06 +01:00
Ania Borowiec
0cf3d0e20a
Update comment to reflect the current state of the code 2025-11-27 22:10:02 +00:00
Mohammad Varmazyar
4c2fff1934 Address comments, log level, test assersion consistency and remove unnecessary locks in TestFlushUnschedulablePodsLeftoverSetsFlag 2025-11-26 14:08:05 +01:00
Mohammad Varmazyar
4f455c9c0d Refactor plugin clearing to use ClearRejectorPlugins method 2025-11-26 09:54:32 +01:00
Mohammad Varmazyar
bc632c72d0 scheduler: add metric for pods scheduled after flush
Add counter metric to track pods that schedule immediately after
being flushed from unschedulablePods due to timeout. Uses a boolean
flag that is cleared when pods return to queue or move via events.
2025-11-24 09:38:41 +01:00
Mohammad Varmazyar
b2a399cf30 scheduler: add metric for pods scheduled after flush
This metric tracks pods that successfully schedule after being
flushed from unschedulablePods due to timeout. High values may
indicate missing queue hint optimizations or event handling issues.
2025-11-24 09:38:40 +01:00
Ravi Sastry Kadali
9dc5683c56 scheduler: Fix memory leak in scheduler cache
The `removeSlice` function was leaving behind references to the
removed element, preventing it from being garbage-collected.
This commit ensures that removed entries are fully cleared,
eliminating the memory leak.

Co-authored-by: ravisastryk <ravisastryk@gmail.com>
Signed-off-by: Sujal Shah <sujalshah28092004@gmail.com>
2025-11-20 02:18:38 +05:30
bwsalmon
854e67bb51
KEP 5598: Opportunistic Batching (#135231)
* First version of batching w/out signatures.

* First version of pod signatures.

* Integrate batching with signatures.

* Fix merge conflicts.

* Fixes from self-review.

* Test fixes.

* Fix a bug that limited batches to size 2
Also add some new high-level logging and
simplify the pod affinity signature.

* Re-enable batching on perf tests for now.

* fwk.NewStatus(fwk.Success)

* Review feedback.

* Review feedback.

* Comment fix.

* Two plugin specific unit tests.:

* Add cycle state to the sign call, apply to topo spread.
Also add unit tests for several plugi signature
calls.

* Review feedback.

* Switch to distinct stats for hint and store calls.

* Switch signature from string to []byte

* Revert cyclestate in signs. Update node affinity.
Node affinity now sorts all of the various
nested arrays in the structure. CycleState no
longer in signature; revert to signing fewer
cases for pod spread.

* hack/update-vendor.sh

* Disable signatures when extenders are configured.

* Update pkg/scheduler/framework/runtime/batch.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update staging/src/k8s.io/kube-scheduler/framework/interface.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Review feedback.

* Disable node resource signatures when extended DRA enabled.

* Review feedback.

* Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update pkg/scheduler/framework/interface.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Update pkg/scheduler/framework/runtime/batch.go

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>

* Review feedback.

* Fixes for review suggestions.

* Add integration tests.

* Linter fixes, test fix.

* Whitespace fix.

* Remove broken test.

* Unschedulable test.

* Remove go.mod changes.

---------

Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
2025-11-12 21:51:37 -08:00
ndixita
5ac2ffcc1e
Enabling NodeDeclaredFeatures in unit tests
Signed-off-by: ndixita <ndixita@google.com>
2025-11-12 08:26:15 +00:00
ndixita
7645eb70e9
Scheduler changes to support pod level resources in place resize 2025-11-11 18:15:22 +00:00