Commit graph

5875 commits

Author SHA1 Message Date
Richa Banker
7ffcacbf9e Wire informer name through kube-controller-manager shared informers 2026-02-05 11:45:47 -08:00
Richa Banker
864357774f Add FIFO queue depth metrics 2026-02-05 11:42:20 -08:00
Kubernetes Prow Robot
6dfae1df46
Merge pull request #136551 from kannon92/fix-pod-replacement-policy-review-nits
Use Log instead of Logf for job integration tests
2026-01-29 00:27:50 +05:30
Kubernetes Prow Robot
ab78ad32d1
Merge pull request #136194 from bart0sh/PR216-add-signode-approvers-to-dra-owners
add SIG-Node approvers to DRA dirs
2026-01-27 20:15:52 +05:30
Kevin Hannon
da76c98b4d Use Log instead of Logf for job integration where we don't have any varidatic arguments 2026-01-26 16:39:00 -05:00
Kubernetes Prow Robot
53b29a3a2c
Merge pull request #136269 from pohly/dra-scheduler-double-allocation-fixes
DRA scheduler: double allocation fixes
2026-01-26 20:59:50 +05:30
Patrick Ohly
001ec49eb6 DRA integration: more pods per node, more parallelism
Long running tests like TestDRA/all/DeviceBindingConditions (42.50s)
should run in parallel with other tests, otherwise the overall runtime is too
high.

This then must allow more pods per node to avoid blocking scheduling.
2026-01-26 15:44:49 +01:00
Patrick Ohly
2198d96520 DRA integration: add "uses all resources" test
This corresponds to an E2E test which sometimes (but very rarely) flaked in the
CI.
2026-01-26 15:44:48 +01:00
Mads Jensen
757647786d Remove redundant re-assignments in for-loops in test/{e2e,integration,utils}
The modernize forvar rule was applied. There are more details in this blog
post: https://go.dev/blog/loopvar-preview
2026-01-25 22:58:27 +01:00
Kubernetes Prow Robot
5f4adaf579
Merge pull request #136303 from ShaanveerS/fix-flake
scheduler: deflake TestUnReservePreBindPlugins
2026-01-23 05:17:27 +05:30
ShaanveerS
42b16a8dd8 scheduler: deflake TestUnReservePreBindPlugins 2026-01-22 12:27:45 +01:00
Kubernetes Prow Robot
6fde485ec9
Merge pull request #135309 from richabanker/zpages
Enhance content negotiation for zpages
2026-01-22 03:17:25 +05:30
Patrick Ohly
fda3bdbd5e DRA tests: stop using deprecated ktesting functions
Some of them were already converted previously, but didn't take full advantage
of the more flexible methods: errors can be checked again by Gomega.
2026-01-19 08:27:15 +01:00
Kubernetes Prow Robot
8de4a11252
Merge pull request #136156 from pohly/dra-upgrade-downgrade-refactor-2
DRA: upgrade/downgrade refactor, II
2026-01-16 23:31:15 +05:30
Kubernetes Prow Robot
08764697f4
Merge pull request #135381 from kannon92/mutable-pod-replacement-policy
[KEP-5440]: Add integration test for MutablePodResourcesForSuspendedJobs with Pod Replacement Policy = Failed
2026-01-16 19:29:16 +05:30
Patrick Ohly
1847d5b1a2 DRA e2e+integration: test ResourceSlice controller
The "create 100 slices" E2E sometimes flaked with timeouts (e.g. 95 out of 100
slices created). It created too much load for an E2E test.

The same test now uses ktesting as API, which makes it possible to run it as
integration test with the original 100 slices and with more moderate 10 slices
as E2E test.

(cherry picked from commit c47ad64820)
2026-01-16 08:10:37 +01:00
Kubernetes Prow Robot
0ba578f91f
Merge pull request #135393 from tosi3k/parallel-prebind
Run PreBind plugins in parallel
2026-01-15 12:39:34 +05:30
Ed Bartosh
d966d9b89d scheduler_perf: use -benchtime=1x in the test examples
Update scheduler performance test examples to use `-benchtime=1x`
instead of `-benchtime=1ns` for explicitly running each benchmark
exactly once. This makes the intent clearer and aligns the examples
with recommended Go benchmark usage.

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2026-01-14 11:07:32 +02:00
Ed Bartosh
5c77f56ae6 add SIG-Node approvers to DRA dirs 2026-01-13 11:52:29 +02:00
Kubernetes Prow Robot
d68d48073f
Merge pull request #136112 from danwinship/network-1.36-cleanup
Drop TopologyAwareHints and ServiceTraficDistribution feature gates
2026-01-13 07:43:36 +05:30
Karthik Bhat
8962f08815 Remove deprecated test methods 2026-01-12 16:15:04 +05:30
Antoni Zawodny
833b7205fc Run PreBind plugins in parallel if feasible 2026-01-11 14:19:18 +01:00
Patrick Ohly
e999d595b1 testing: partial revert of E2E + DRA upgrade/downgrade
Refactoring the DRA upgrade/downgrade testing such that it runs as Go test
depended on supporting ktesting in the E2E framework. That change worked during
presubmit testing, but broke some periodic jobs. Therefore the relevant commits
from https://github.com/kubernetes/kubernetes/pull/135664/commits get reverted:

c47ad64820 DRA e2e+integration: test ResourceSlice controller
047682908d ktesting: replace Begin/End with TContext.Step
de47714879 DRA upgrade/downgrade: rewrite as Go unit test
7c7b1e1018 DRA e2e: make driver deployment possible in Go unit tests
65ef31973c DRA upgrade/downgrade: split out individual test steps
47b613eded e2e framework: support creating TContext

The last one is what must have caused the problem, but the other commits depend
on it.
2026-01-11 09:55:17 +01:00
Dan Winship
f278b47ecd Drop TopologyAwareHints and ServiceTraficDistribution feature gates 2026-01-09 12:42:34 -05:00
Kubernetes Prow Robot
407b1de3bf
Merge pull request #136076 from kannon92/fix-flake-mutable-job
[flake] wait for job suspended condition for JobMutable test cases
2026-01-09 16:39:39 +05:30
Kevin Hannon
2a9c44b329 wait for job suspended condition 2026-01-08 15:57:00 -05:00
Kubernetes Prow Robot
e551ea5ea5
Merge pull request #133678 from mortent/AllocatorPerfImprovements
DRA: Avoid unnecessary work in allocator
2026-01-09 01:19:41 +05:30
Morten Torkildsen
9562aa8ba5 DRA: Avoid unnecessary work in allocator 2026-01-08 16:52:44 +00:00
Patrick Ohly
c47ad64820 DRA e2e+integration: test ResourceSlice controller
The "create 100 slices" E2E sometimes flaked with timeouts (e.g. 95 out of 100
slices created). It created too much load for an E2E test.

The same test now uses ktesting as API, which makes it possible to run it as
integration test with the original 100 slices and with more moderate 10 slices
as E2E test.
2026-01-07 14:11:33 +01:00
Patrick Ohly
551cf6f171 ktesting: reimplement without interface
The original implementation was inspired by how context.Context is handled via
wrapping a parent context. That approach had several issues:

- It is useful to let users call methods (e.g. tCtx.ExpectNoError)
  instead of ktesting functions with a tCtx parameters, but that only
  worked if all implementations of the interface implemented that
  set of methods. This made extending those methods cumbersome (see
  the commit which added Require+Assert) and could potentially break
  implementations of the interface elsewhere, defeating part of the
  motivation for having the interface in the first place.

- It was hard to see how the different TContext wrappers cooperated
  with each other.

- Layering injection of "ERROR" and "FATAL ERROR" on top of prefixing
  with the klog header caused post-processing of a failed unit test to
  remove that line because it looked like log output. Other log output
  lines where kept because they were not indented.

- In Go <=1.25, the `go vet sprintf` check only works for functions and
  methods if they get called directly and themselves directly pass their
  parameters on to fmt.Sprint. The check does not work when calling
  methods through an interface. Support for that is coming in Go 1.26,
  but will depend on bumping the Go version also in go.mod and thus
  may not be immediately possible in Kubernetes.

- Interface documentation in
  https://pkg.go.dev/k8s.io/kubernetes@v1.34.2/test/utils/ktesting#TContext
  is a monolithic text block. Documentation for methods is more readable and allows
  referencing those methods with [] (e.g. [TC.Errorf] works, [TContext.Errorf]
  didn't).

The revised implementation is a single struct with (almost) no exported
fields. The two exceptions (embedded context.Context and TB) are useful because
it avoids having to write wrappers for several functions resp. necessary
because Helper cannot be wrapped. Like a logr.LogSink, With* methods can make a
shallow copy and then change some fields in the cloned instance.

The former `ktesting.TContext` interface is now a type alias for
`*ktesting.TC`. This ensures that existing code using ktesting doesn't need to
be updated and because that code is a bit more compact (`tCtx
ktesting.TContext` instead of `tCtx *ktesting.TContext` when not using such an
alias). Hiding that it is a pointer might discourage accessing the exported
fields because it looks like an interface.

Output gets fixed and improved such that:
- "FATAL ERROR" and "ERROR" are at the start of the line, followed by the klog header.
- The failure message follows in the next line.
- Continuation lines are always indented.

The set of methods exposed via TB is now a bit more complete (Attr, Chdir).

All former stand-alone With* functions are now also available as methods and
should be used instead of the functions. Those will be removed.

Linting of log calls now works and found some issues.
2026-01-05 13:45:03 +01:00
Kubernetes Prow Robot
8d1296caf2
Merge pull request #135912 from pohly/scheduler-plugin-test-data-race
scheduler: plugin test DATA RACE fix
2025-12-29 14:46:35 +05:30
Kubernetes Prow Robot
ed4b5ee317
Merge pull request #134350 from macsko/add_scheduling_duration_collector
Add scheduling duration collector to scheduler_perf
2025-12-28 05:50:33 +05:30
Patrick Ohly
f758d0850b scheduler: plugin test DATA RACE fix
Reading numPreFilterCalled races with writing it in the scheduler, at least as
far as the data race detector is concerned. That the test waits for pod
scheduling is too indirect. enqueuePlugin.called has the same problem,
but hasn't triggered the race detector (yet).

We need to protect against concurrent access. The easiest way to enforce that
is via atomic.Int64. In contrast to a mutex it is impossible to use it wrong.

Shutting down the scheduler first was also tried, but didn't work out because
"teardown" does more than just stopping the scheduler, it also cancels a
context that is needed during test shutdown.
2025-12-23 19:13:53 +01:00
Kubernetes Prow Robot
b9d491f56e
Merge pull request #134556 from carlory/fix-133160
lock the feature-gate VolumeAttributesClass to default (true)
2025-12-18 15:13:17 -08:00
Patrick Ohly
ad79e479c2 build: remove deprecated '// +build' tag
This has been replaced by `//build:...` for a long time now.

Removal of the old build tag was automated with:

    for i in $(git grep -l '^// +build' | grep -v -e '^vendor/'); do if ! grep -q '^// Code generated' "$i"; then sed -i -e '/^\/\/ +build/d' "$i"; fi; done
2025-12-18 12:16:21 +01:00
carlory
f8e8e55f1d
locked the feature-gate VolumeAttributesClass to default (true) and switch storage version from v1beta1 to v1
Signed-off-by: carlory <baofa.fan@daocloud.io>
2025-12-18 15:59:33 +08:00
Kubernetes Prow Robot
d9c281159a
Merge pull request #135494 from Argh4k/readme-fix
Fix example with profiling in README
2025-12-17 22:36:21 -08:00
Kubernetes Prow Robot
43cfcac7cc
Merge pull request #135434 from yliaog/quota_abuse
Fixes the loophole that allows users to workaround resource quota set by system admin
2025-12-17 22:35:28 -08:00
Kubernetes Prow Robot
a2a97119bb
Merge pull request #135361 from Karthik-K-N/cel-test-imporvements
CEL test imporvements to use test context across test instead of generic context
2025-12-17 21:41:45 -08:00
Kubernetes Prow Robot
fefd7ddc37
Merge pull request #135348 from brejman/issue-134393-perf
Add perf test for scheduling pods matching existing pods antiaffinity
2025-12-17 21:41:29 -08:00
Kubernetes Prow Robot
285eb9fdba
Merge pull request #135325 from brejman/issue-134393
Fix queue hint for inter-pod anti-affinity
2025-12-17 20:01:02 -08:00
Kubernetes Prow Robot
f9761d1319
Merge pull request #135301 from bwsalmon/bsalmon-batch-after
Fix a bug in scheduler_perf integration test
2025-12-17 20:00:39 -08:00
yliao
3e34de29c4 fixed the loophole that allows user to get around resource quota set by system admin 2025-12-18 00:56:20 +00:00
Richa Banker
e179f38cb8 zpages - add proper handling of the application/yaml Accept Header 2025-12-17 15:57:29 -08:00
Bartosz
49035d1404
Add perf test for scheduling pods matching existing pods antiaffinity 2025-12-16 13:02:11 +00:00
Bartosz
d6d8639349
Fix queue hint for interpod antiaffinity 2025-12-16 13:01:15 +00:00
Maciej Skoczeń
bfc44a42d5 Allow to change scheduler_perf threshold data bucket 2025-12-15 14:39:56 +00:00
Antonio Ojea
51f614a156 ipallocator: handle errors correctly
The ipallocator was blindly assuming that all errors are retryable, that
causes that the allocator tries to exhaust all the possibilities to
allocate an IP address.

If the error is not retryable this means the allocator will generate as
many API calls as existing available IPs are in the allocator, causing
CPU exhaustion since this requests are coming from inside the apiserver.

In addition to handle the error correctly, this patch also interpret the
error to return the right status code depending on the error type.

Co-authored-by: carlory <baofa.fan@daocloud.io>
2025-12-03 10:39:57 +00:00
Maciej Skoczeń
e22a30a13e Add scheduling duration collector to scheduler_perf 2025-12-02 14:48:22 +00:00
Maciej Wyrzuc
9a8c2a4001 Fix example with profiling in README 2025-12-01 10:44:15 +00:00