Commit graph

120 commits

Author SHA1 Message Date
Bartosz
6a8cebaf3b Make sure gated pods are flushed with the same frequency as non-gated 2026-06-19 13:51:17 +00:00
Maciej Skoczeń
54ca619d4b Merge GangScheduling and WorkloadAwarePreemption feature gates into GenericWorkload 2026-06-15 11:42:10 +00:00
Kubernetes Prow Robot
7c7605a853
Merge pull request #139330 from brejman/fix-stuck-preemption-followup
Unset WasFlushedFromUnschedulable for gated pods
2026-06-05 18:33:55 +05:30
Bartosz
3ba1167826
Unset WasFlushedFromUnschedulable for gated pods 2026-06-01 12:22:28 +00:00
Maciej Skoczeń
d952a85cb8 Fix unlikely data race in PriorityQueue.Update logger 2026-05-28 07:29:02 +00:00
Maciej Skoczeń
8eb66b73ef Add support for PodGroups in scheduling queue 2026-05-27 13:06:13 +00:00
Bartosz
27c939be26
Fix case where preemptor may be stuck in unschedulable queue 2026-05-26 13:35:05 +00:00
Kubernetes Prow Robot
6bca553051
Merge pull request #139057 from macsko/store_pod_nomination_before_adding_to_scheduling_queue
Store pod nomination before adding the pod to scheduling queue
2026-05-14 19:44:28 +05:30
Maciej Skoczeń
613ef88f2b Store pod nomination before adding the pod to scheduling queue 2026-05-14 12:28:51 +00:00
Maciej Wyrzuc
83ba364c7f Avoid unnecessary locking in NominatedPodsForNode 2026-05-14 10:58:36 +00:00
Kubernetes Prow Robot
a5427793cf
Merge pull request #135756 from vshkrabkov/refactor/scheduler-moveToActiveQ-remove-underlock
Remove underlock logic from scheduling queue moveToActiveQ method
2026-05-13 09:18:33 +05:30
Kubernetes Prow Robot
de239f13ef
Merge pull request #135160 from KunWuLuan/feat/multi-cond-apicall
support multi conditions in apicall
2026-05-13 09:18:26 +05:30
Kubernetes Prow Robot
dfbe9362f9
Merge pull request #138482 from vshkrabkov/bug/usched-pods-metric
decrease metrics for unscheduled plugin when removing pod from active or backoff queue
2026-05-08 22:15:18 +05:30
vshkrabkov
ce26010a71 decrease metrics for unscheduled plugin when removing pod from active or backoff queue 2026-05-08 14:52:26 +00:00
Jarosław Dzikowski
903f5c308b Remove automatically adding UpdateNodeTaint event to Add node event 2026-04-30 10:22:29 +00:00
Jarosław Dzikowski
50f08420d3 Graduate SchedulerQueueingHints feature gate 2026-04-23 10:17:33 +00:00
vshkrabkov
a4819d7586 add concurrent test scenario 2026-04-23 09:57:29 +00:00
vshkrabkov
41bb825456 addressing review comments 2026-04-22 14:43:10 +00:00
vshkrabkov
7d67954caa Remove add() and has() methods from unlockedActiveQ 2026-04-22 14:43:08 +00:00
Vlad Shkrabkov
afcd3fd899 Refactor PriorityQueue.Update to remove underLock and redundant checks. Removes unnecessary existence checks (activeQ.has, backoffQ.has, unschedulablePods.get) and replaces the activeQ.underLock closure with a self-locking activeQ.add call. 2026-04-22 14:41:23 +00:00
iomarsayed
7a54834917 split pod resource types to help plugins register to only cluster events which they require 2026-04-17 08:29:24 +00:00
Maciej Skoczeń
0e8d10fe48 Handle gated pods correctly while popping PodGroup members 2026-03-17 10:56:38 +00:00
Maciej Skoczeń
38f536c713 Use CycleState.IsPodGroupSchedulingCycle instead of NeedsPodGroupScheduling for pods 2026-03-17 09:10:52 +00:00
Roman Baron
6fcb95e72e scheduler: Moved TestQueuedPodInfo_UpdateInvalidatesSignature from queue/scheduling_queue_test.go to framework/types_test.go 2026-03-12 21:24:23 +02:00
Roman Baron
1e0545b1fa scheduler: Removed TestPriorityQueue_SignatureReuse 2026-03-12 20:52:51 +02:00
Roman Baron
e436c1c812 scheduler: Added a test that validates that adding pod to a queue calculates pod signature and added more cases when pod is updated. Specifically when pod is in backoffQ and unschedulablePods or does not exist in any queue 2026-03-12 20:43:14 +02:00
Roman Baron
c0e973dc70 scheduler: Replaced context.Context and testing.T parameters with ktesting.TContext in scheduling_queue_test.go 2026-03-12 17:31:11 +02:00
Roman Baron
863c68108c
Update pkg/scheduler/backend/queue/scheduling_queue.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
2026-03-12 15:47:24 +02:00
Roman Baron
7b00255135 scheduler: Removed plugin stats from pod signing process 2026-03-12 10:31:04 +02:00
Roman Baron
64db71a727 scheduler: Removed logger parameter from functions that also receive context in scheduling_queue_test.go 2026-03-12 10:30:38 +02:00
Roman Baron
67cc4e5412 scheduler: Removed logger from functions that also receive context 2026-03-12 10:30:38 +02:00
Roman Baron
2904e7f309 scheduler: replaced logger with HandleErrorWithLogger 2026-03-12 10:30:38 +02:00
Roman Baron
2a394caa07 scheduler: Added isOpportunisticBatchingEnabled in scheduling_queue.go 2026-03-12 10:30:38 +02:00
Roman Baron
f032cb548a scheduler: Added logger parameter to signPod in scheduling_queue.go 2026-03-12 10:30:38 +02:00
Roman Baron
85331e916f Update pkg/scheduler/backend/queue/scheduling_queue.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
2026-03-12 10:30:38 +02:00
Roman Baron
858040f604 scheduler: Added eager pod signature calculation 2026-03-12 10:30:38 +02:00
Roman Baron
3a6c169034 scheduler: Reuse scheduling signature for opportunistic batching 2026-03-12 10:30:32 +02:00
Antoni Zawodny
3f094dc228
Create Workload API v1alpha2 (#136976)
* Drop WorkloadRef field and introduce SchedulingGroup field in Pod API

* Introduce v1alpha2 Workload and PodGroup APIs, drop v1alpha1 Workload API

Co-authored-by: yongruilin <yongrlin@outlook.com>

* Run hack/update-codegen.sh

* Adjust kube-scheduler code and integration tests to v1alpha2 API

* Drop v1alpha1 scheduling API group and run make update

---------

Co-authored-by: yongruilin <yongrlin@outlook.com>
2026-03-10 07:59:10 +05:30
Patrick Ohly
b895ce734f golangci-lint: bump to logtools v0.10.1
This fixes a bug that caused log calls involving `klog.Logger` to not be
checked.

As a result we have to fix some code that is now considered faulty:

    ERROR: pkg/controller/serviceaccount/tokens_controller.go:382:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (e *TokensController) generateTokenIfNeeded(ctx context.Context, logger klog.Logger, serviceAccount *v1.ServiceAccount, cachedSecret *v1.Secret) ( /* retry */ bool, error) {
    ERROR: ^
    ERROR: pkg/controller/storageversionmigrator/storageversionmigrator.go:299:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (svmc *SVMController) runMigration(ctx context.Context, logger klog.Logger, gvr schema.GroupVersionResource, resourceMonitor *garbagecollector.Monitor, toBeProcessedSVM *svmv1beta1.StorageVersionMigration, listResourceVersion string) (err error, failed bool) {
    ERROR: ^
    ERROR: pkg/proxy/node.go:121:3: logging function "Error" should not use format specifier "%q" (logcheck)
    ERROR: 		klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to exist", nodeName)
    ERROR: 		^
    ERROR: pkg/proxy/node.go:123:3: logging function "Error" should not use format specifier "%q" (logcheck)
    ERROR: 		klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to be assigned IPs", nodeName)
    ERROR: 		^
    ERROR: pkg/scheduler/backend/queue/scheduling_queue.go:610:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (p *PriorityQueue) runPreEnqueuePlugin(ctx context.Context, logger klog.Logger, pl fwk.PreEnqueuePlugin, pInfo *framework.QueuedPodInfo, shouldRecordMetric bool) *fwk.Status {
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:286:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) deleteClaim(ctx context.Context, claim *resourceapi.ResourceClaim, logger klog.Logger) error {
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:499:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) waitForExtendedClaimInAssumeCache(
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:528:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) createExtendedResourceClaimInAPI(
    ERROR: ^
    ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:592:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (pl *DynamicResources) unreserveExtendedResourceClaim(ctx context.Context, logger klog.Logger, pod *v1.Pod, state *stateData) {
    ERROR: ^
    ERROR: pkg/scheduler/framework/runtime/batch.go:171:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (b *OpportunisticBatch) batchStateCompatible(ctx context.Context, logger klog.Logger, pod *v1.Pod, signature fwk.PodSignature, cycleCount int64, state fwk.CycleState, nodeInfos fwk.NodeInfoLister) bool {
    ERROR: ^
    ERROR: staging/src/k8s.io/component-base/featuregate/feature_gate.go:890:4: Additional arguments to Info should always be Key Value pairs. Please check if there is any key or value missing. (logcheck)
    ERROR: 			logger.Info("Warning: SetEmulationVersionAndMinCompatibilityVersion will change already queried feature", "featureGate", feature, "oldValue", oldVal, newVal)
    ERROR: 			^
    ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:108:2: logging function "Info" should not use format specifier "%s" (logcheck)
    ERROR: 	logger.Info("pluginSocksDir: %s", pluginSocksDir)
    ERROR: 	^
    ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:123:2: logging function "Info" should not use format specifier "%s" (logcheck)
    ERROR: 	logger.Info("CDI_ENABLED: %s", cdiEnabled)
    ERROR: 	^

While waiting for this to merge, another call was added which also doesn't
follow conventions:

    ERROR: pkg/kubelet/kubelet.go:2454:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
    ERROR: func (kl *Kubelet) deletePod(ctx context.Context, logger klog.Logger, pod *v1.Pod) error {
    ERROR: ^

Contextual logging has been beta and enabled by default for several releases
now. It's mostly just a matter of wrapping up and declaring it GA. Therefore
the calls which directly call WithName or WithValues (always have an effect)
are left as-is instead of converting them to use the klog wrappers (support
disabling the effect). To allow that, the linter gets reconfigured to not
complain about this anymore, anywhere.

The calls which would have to be fixed otherwise are:

    ERROR: pkg/kubelet/cm/dra/claiminfo.go:170:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-claiminfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/healthinfo.go:45:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-healthinfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/healthinfo.go:89:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-healthinfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/healthinfo.go:157:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger = logger.WithName("dra-healthinfo")
    ERROR: 	         ^
    ERROR: pkg/kubelet/cm/dra/manager.go:175:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:239:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:593:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:781:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(context.Background()).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager.go:898:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-manager")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/manager_test.go:1638:15: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 				logger := klog.FromContext(streamCtx).WithName(st.Name())
    ERROR: 				          ^
    ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:77:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-plugin")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:108:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-plugin")
    ERROR: 	          ^
    ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:161:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	logger := klog.FromContext(ctx).WithName("dra-plugin")
    ERROR: 	          ^
    ERROR: staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker.go:695:14: function "WithValues" should be called through klogr.LoggerWithValues (logcheck)
    ERROR: 			logger := logger.WithValues("device", deviceID)
    ERROR: 			          ^
    ERROR: test/integration/apiserver/watchcache_test.go:42:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	etcd0URL, stopEtcd0, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd0"), "etcd_watchcache0", etcdArgs)
    ERROR: 	                                                    ^
    ERROR: test/integration/apiserver/watchcache_test.go:47:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 	etcd1URL, stopEtcd1, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd1"), "etcd_watchcache1", etcdArgs)
    ERROR: 	                                                    ^
    ERROR: test/integration/scheduler_perf/scheduler_perf.go:1149:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
    ERROR: 		logger = logger.WithName(tCtx.Name())
    ERROR: 		         ^
2026-03-04 12:08:18 +01:00
Kubernetes Prow Robot
f9c9f03b05
Merge pull request #136618 from macsko/workload_scheduling_cycle
KEP-4671: Introduce Workload Scheduling Cycle
2026-02-17 15:21:04 +05:30
Maciej Skoczeń
6233b25907 Introduce Workload Scheduling Cycle
Add integration tests for gang and basic policy workload scheduling

Add more tests for cluster snapshot

Proceed to binding cycle just after pod group cycle

Enforce one scheduler name per pod group, rename workload cycle to pod group cycle

Add unit tests for pod group scheduling cycle

Run ScheduleOne tests treating pod as part of a pod group

Rename NeedsPodGroupCycle to NeedsPodGroupScheduling

Observe correct per-pod and per-podgroup metrics during pod group cycle

Rename pod group algorithm status to waiting_on_preemption

Mention forgotAllAssumedPods is a safety check
2026-02-17 09:02:32 +00:00
Vlad Shkrabkov
cbfc201afa Remove nil checks for unschedulable pods metrics recorder 2026-02-06 11:03:37 +00:00
Maciej Skoczeń
01540ed1d5 Cleanup scheduling queue tests to use cmp.Diff instead of direct pod comparisons 2026-01-29 13:08:00 +00:00
vshkrabkov
b78cdbfdf4 Adds test cases for multiple preEnqueue plugins 2026-01-09 15:35:48 +00:00
vshkrabkov
779ff43005 Add unschedulabe pods metric drop for pod deletion 2026-01-07 15:17:27 +00:00
KunWuLuan
e37d012bce
support multi conditions in apicall
Signed-off-by: KunWuLuan <kunwuluan@gmail.com>
2026-01-05 16:51:34 +08:00
Manthan Parmar
41cde37f00 Update pkg/scheduler/backend/queue/scheduling_queue.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
2025-12-30 15:05:51 +00:00
Manuel Grandeit
66d4bd3206 Fix data race in PriorityQueue.UnschedulablePods()
The UnschedulablePods() function iterates over the unschedulablePods.podInfoMap
without holding any lock, while other goroutines may concurrently modify the map
via addOrUpdate(), delete(), or clear().

Other functions like PendingPods() and GetPod() correctly acquire p.lock.RLock()
before accessing unschedulablePods.podInfoMap, but UnschedulablePods() was
missing this.

Fix by adding p.lock.RLock()/RUnlock() to UnschedulablePods(), matching the
pattern used by PendingPods().
2025-12-20 13:46:58 +01:00
Kubernetes Prow Robot
1757c6358b
Merge pull request #135368 from vshkrabkov/fix/scheduler-queue-metric-sync
Scheduler: Fix GatedPods metric desync in unschedulable queue
2025-12-17 21:42:00 -08:00
Vlad Shkrabkov
5be527b78e Scheduler: Fix GatedPods metric desync in unschedulable queue
Previously, when a Pod residing in the 'unschedulablePods' queue was updated and subsequently rejected by PreEnqueue plugins (returning 'Wait'), the logic in 'moveToActiveQ' would return early because the Pod was already present in the queue.

This caused the 'scheduler_gated_pods_total' metric to fail to increment, leading to metric inconsistencies (and potentially negative values upon Pod deletion).

This change adds a check to detect the transition from Ungated to Gated. If detected, the Pod is removed and re-added to the queue to ensure metrics are correctly swapped (Unschedulable-- and Gated++).

Added regression test 'TestSchedulingQueueMetrics_UngatedToGated' to verify the fix.

Signed-off-by: Vlad Shkrabkov <vshkrabkov@google.com>
2025-12-15 11:47:22 +00:00