Commit graph

3529 commits

Author SHA1 Message Date
Kubernetes Prow Robot
874a7b40b0
Merge pull request #138617 from esotsal/kubeletHealthCheckRefactor
Move kubeletHealthCheck from e2enode to node as HealthCheck
2026-05-12 02:26:10 +05:30
Kubernetes Prow Robot
5cf56a97d5
Merge pull request #138851 from saschagrunert/fix/container-metrics-flake
Fix ContainerMetrics cadvisor test flake for block I/O metrics
2026-05-10 18:37:47 +05:30
Sotiris Salloumis
20c57876a4 Increase bound CPU limit to 2e+10 to fix admission api flaky test.
After replacing the command to increase UsageNanoCores, to fix a previous flaky test,
in some test environments, UsageNanoCores exceeds the limit 2e+09, this commit
attempts to fix this by ncreasing UsageNanoCores limit to 2e+10.
2026-05-09 09:46:23 +02:00
Kubernetes Prow Robot
4818833ecc
Merge pull request #138820 from esotsal/fix-sriov-cpumanager
Fix podresources flaky test: wait for Pod Resources V1 serving in flaky test
2026-05-08 00:05:18 +05:30
Sascha Grunert
ee9f8c6bde
Fix ContainerMetrics cadvisor test flakes
Replace the small echo write with a dd that uses conv=fsync to force
data through the block layer. Without fsync, the 11-byte echo writes
stay in page cache and never reach the block device within the
60-second test window. This leaves the cgroup io.stat empty, so
cadvisor does not emit container_blkio_device_usage_total,
container_fs_reads_bytes_total, or container_fs_writes_bytes_total
for the container.

The conv=fsync call guarantees block device I/O on every loop
iteration. Once io.stat has an entry for a device, all fields
(rbytes, wbytes, rios, wios) are present, even if zero, so all
cadvisor metrics pass their boundedSample(0, ...) checks.

Also increase the UsageCoreNanoSeconds upper bound from 1e11 to 1e12
for the container and pod-level CPU checks. The cumulative CPU time
can exceed 100s on slower architectures like ppc64le where the dd
CPU burner loop accumulates faster than expected.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2026-05-07 15:01:02 +02:00
Kubernetes Prow Robot
d92b8fe8f2
Merge pull request #138739 from zxqlxy/device-plugin-slow-register
Add e2e test for device plugin slow register
2026-05-07 11:42:31 +05:30
Sotiris Salloumis
acabaa7d50 Fix podresources flaky test: wait for Pod Resources V1 serving in flaky test
One podresources test, was not waiting for Pod Resources V1 to be serving.
This can lead to flaky tests in a next step.

This change attempts to fix this flaky test, by adding waitForPodResourcesV1Serving(ctx)
as done on remaining tests. In addition ExpectNoError was added to all closing connection
attempts, to improve troubleshooting.
2026-05-07 05:35:17 +02:00
Xinyun Liu
62e23b9857 Add E2E test for multiple device plugin and second one is struggle to register 2026-05-06 23:48:32 +00:00
Paco Xu
11d08fcb7f
Revert "remove flaky label in SRIOV related tests" 2026-05-06 17:11:33 +08:00
Sotiris Salloumis
5486715fbf Move kubeletHealthCheck from e2enode to node
To reduce duplication of code and overcome import cycle
not allowed error during compile time, when used in non
e2e_node packages.
2026-05-05 20:39:07 +02:00
Kubernetes Prow Robot
d2b48c52df
Merge pull request #138716 from lukaszwojciechowski/fix-sriov-teardown
fix: SRIOV resources cleanup in runTMScopeResourceAlignmentTestSuite
2026-05-05 14:16:21 +05:30
Kubernetes Prow Robot
43f4e90bee
Merge pull request #138755 from saschagrunert/fix-crio-conformance-container-metrics
Replace openssl speed CPU burner in summary_test.go
2026-05-05 13:02:29 +05:30
Kubernetes Prow Robot
f8535a28a6
Merge pull request #138462 from shachartal/fix/sidecar-ephemeral-storage-eviction
kubelet: enforce ephemeral-storage limits on restartable init containers
2026-05-04 23:44:23 +05:30
Sascha Grunert
e7bc0479c0
Replace openssl speed CPU burner in summary_test.go
The openssl speed command added in #138423 causes the ContainerMetrics
cadvisor test to fail on CRI-O conformance (ci-node-crio-conformance)
by exceeding upper bounds for container_fs_writes_total,
container_blkio_device_usage_total and container_memory_failures_total.

Replace with a lightweight dd-based CPU burner that generates CPU load
via syscall overhead without filesystem I/O or memory side effects.
Revert the bound changes to pre-#138423 values. The underlying
UsageNanoCores issue is better addressed at the kubelet level
by #138687.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2026-05-04 15:31:18 +02:00
Sotiris Salloumis
c306df361c Fix Summary API resource usage test
Make the process in the container more cpu intensive to make sure
we catch CPU usage more than nanocore, within the test window to overcome
a known limitation in older containerd versions.

Increase UsageNanoCores and UsageCoreNanoSeconds boundaries, to cater for
the additional cpu loads.
2026-05-03 18:31:43 +02:00
Kubernetes Prow Robot
10104afcde
Merge pull request #138289 from esotsal/fix-device-plugin-test-flaky
Fix pull-kubernetes-node-kubelet-serial-containerd flaky tests
2026-05-03 05:21:23 +05:30
Kubernetes Prow Robot
bb0bcb8a85
Merge pull request #135437 from zxqlxy/device-plugin-fix
DRA-like fix for device-plugin race condition problem
2026-05-02 10:01:24 +05:30
Xinyun Liu
ce680fea20 Add E2E test for multiple device-plugins scenarios 2026-05-01 21:35:15 +00:00
Kubernetes Prow Robot
487cdf46c8
Merge pull request #138642 from stlaz/ensure-node-e2e-investigate
e2e node conformance: fix the EnsureCredentialPulledImages test flakes
2026-05-01 03:03:32 +05:30
Sotiris Salloumis
04aa178a5a Fix pull-kubernetes-node-kubelet-serial-containerd flaky tests
Improve testdeviceplugin to healthcheck kubelet and
fail early if kubelet is not healthy.

Check sampledeviceplugin pod logs, and perform manual registration
only after the container has entered the registration loop.

Add printouts of sampledeviceplugin pod after each device-plugin-test
test, for troubleshooting.

Fix flaky test upon failed admission in device_plugin_test, ensuring containers are stopped,
and then by checking first that number of device plugins are one before checking the containers
matching devices.

Fix Resources API SRIOV flaky test, by cleaning up pods Before Each test step.

Clean up pod-stress and memory-qos test pods AfterEach test step.
2026-04-30 20:22:26 +02:00
Lukasz Wojciechowski
100c351b4b fix: use DeferCleanup for SRIOV resources in runTMScopeResourceAlignmentTestSuite
The test was calling teardownSRIOVConfigOrFail at the end of the function,
which meant resources would not be cleaned up if any test failed midway.
Using ginkgo.DeferCleanup ensures proper cleanup even on test failure.
2026-04-30 16:59:55 +02:00
Stanislav Láznička
a8ab1bc19c
e2e node conformance: restart the kubelet after removing the image_manager dir to recover it
Signed-off-by: Stanislav Láznička <slznika@microsoft.com>
2026-04-29 10:43:13 +02:00
Stanislav Láznička
cdfc943823
e2e node conformance: print error and directory listing on image record stat failure
Signed-off-by: Stanislav Láznička <slznika@microsoft.com>
2026-04-29 10:24:50 +02:00
zak905
04286814e7 clean up: remove loop variable capture 2026-04-28 23:53:27 +02:00
Shachar Tal
db2380e4c1
Apply suggestions from code review
Co-authored-by: Bing Hongtao <695097494plus@gmail.com>
2026-04-28 10:57:53 +03:00
Kubernetes Prow Robot
75d51c4407
Merge pull request #138258 from pohly/ktesting-cgo
ktesting dependencies
2026-04-27 06:48:46 +05:30
Kubernetes Prow Robot
e8cb34c6d8
Merge pull request #138322 from willie-yao/pid-flake
Fix flaky hostPID security context test by retrying nginx PID file read
2026-04-25 07:00:59 +05:30
Kubernetes Prow Robot
03153864cf
Merge pull request #137930 from rata/userns-idsPerPod-test-fixes
tests: Wait for pod to be removed on kubelet restarts with userns.idsPerPod
2026-04-25 07:00:52 +05:30
Kubernetes Prow Robot
9234064eda
Merge pull request #137627 from hoteye/pr-nodelease-graceful-shutdown-test
test/e2e_node: cover node lease renewal during graceful shutdown
2026-04-25 07:00:45 +05:30
Kubernetes Prow Robot
fea119171f
Merge pull request #138242 from pacoxu/sriov-test
remove flaky label in SRIOV related tests
2026-04-25 04:45:09 +05:30
Kubernetes Prow Robot
9af59744b6
Merge pull request #138200 from rpb-ant/rpb/podresources-skip-leak
e2e_node: podresources: skip cpuAlloc check in BeforeEach, not JustBeforeEach
2026-04-25 04:45:01 +05:30
Patrick Ohly
84190acdaa ktesting: move format package
The format package is used by ktesting, both to reconfigure Gomega and to
format errors, therefore it has to be moved to staging together with ktesting,
if or when we get to that because those are desirable features.

Because format only has the YAML package as additional dependency and that
should be okay for all other repos (except for the YAML package itself, of
course), we can publish the format package as a sub-package of such a future
ktesting module.

Avoiding the dependency on apimachinery to detect unstructured.Unstructured is
a bit tricky, but doable by relaxing what we check for. The test/utils/format
package is kept to test ktesting/format with the actual packages that it cannot
depend on (apimachinery, api).
2026-04-24 21:54:19 +02:00
Kubernetes Prow Robot
c2b57ba319
Merge pull request #138135 from HirazawaUi/add-more-e2e-tests-for-kep-4781
Add e2e test to ensure that the NotReady pod status does not change after kubelet restart
2026-04-23 20:00:45 +05:30
Kubernetes Prow Robot
ce14ead9b2
Merge pull request #138253 from HirazawaUi/remove-duplicate-kubelet-health-checks
E2E_Node: Remove duplicate kubelet health checks
2026-04-23 09:36:51 +05:30
Kubernetes Prow Robot
679a271800
Merge pull request #138143 from dims/fix-cri-proxy-event-stream
test/e2e_node: fix CRI proxy event forwarding
2026-04-23 05:12:12 +05:30
William Yao
0068e4149c
Fix flaky hostPID security context test by retrying nginx PID file read
Signed-off-by: William Yao <william2000yao@gmail.com>
2026-04-22 10:30:56 -07:00
Qi Wang
2aaa5b654b skip MemoryQoS rollback test until implementation is resolved
skip MemoryQoS rollback test until we figure out the mechanism to rollback.

Signed-off-by: Qi Wang <qiwan@redhat.com>
2026-04-20 12:41:45 -04:00
Shachar Tal
d7f380f7e3
kubelet: enforce ephemeral-storage limits on restartable init containers
containerEphemeralStorageLimitEviction() only iterated
pod.Spec.Containers when building the per-container ephemeral-storage
threshold map. Restartable init containers (sidecars) were never
checked against their declared limit, allowing them to exceed it
indefinitely without triggering eviction.

Include restartable init containers in the threshold map so the
existing per-container comparison covers them.
2026-04-19 15:48:51 +03:00
Dylan liu
796856658c test/e2e_node: cover node lease renewal during graceful shutdown
Add a dedicated graceful shutdown e2e_node case to verify that the node lease continues to renew while shutdown is active.

The test uses an extended shutdown window, configures the kubelet lease cadence explicitly, waits for the node to report Ready=False with reason KubeletNotReady, and then checks that the lease renewTime advances multiple times before shutdown completes.
2026-04-14 14:50:19 +08:00
Rodrigo Campos
a138a4825e tests: Wait for pod to be removed on kubelet restart with idsPerPod
The test starts the kubelet with a non-default setting for idsPerPod,
runs a pod, deletes it, and then restarts the kubelet.

The issue is that the kubelet guarantees that no two pods userns
mappings overlap (for security reasons). But we are not waiting for the
pod to be removed, the deleteSync() call only waits for the API server
to remove the pod.

So, the pod is on disk (and maybe even running!) when we restart the
kubelet. As the previous configuration is incompatible with the new one
after restart if pods are running, the kubelet failing is the right
thing. We should just wait for the pod to be deleted from the kubelet
too, before restarting it with an incompatible configuration.

So, this commit just changes the pod deleteion (before done in
e2eoutput.TestContainerOutput() just waiting for the API server) to wait
for the kubelet to delete the pod.

Signed-off-by: Rodrigo Campos <rodrigo@amutable.com>
2026-04-09 11:45:22 +02:00
Ryan Brewster
67ffec43e4
e2e_node: podresources: skip cpuAlloc check in BeforeEach, not JustBeforeEach
When the cpuAlloc check at podresources_test.go:1358 fires, it Skip()s
from a JustBeforeEach at the outer When() level. By that point the inner
When()'s tempSetCurrentKubeletConfig BeforeEach has already rewritten
the kubelet config (including, for the "restricted list output disabled"
block, setting KubeletPodResourcesListUseActivePods=false).

Ginkgo only runs AfterEach hooks at-or-shallower than the node where
Skip() fired (internal/group.go:252), so the inner AfterEach that would
restore the kubelet config is never invoked. The leaked feature gate
then propagates to every subsequent serial test, which breaks
device_plugin_test.go's "Does not keep device plugin assignments across
node reboots if fails admission" on e2-standard-2 nodes.

Moving the check to BeforeEach makes it fire before the inner
BeforeEach runs, so the config is never written. This matches the
identical check at podresources_test.go:1120.

Signed-off-by: Ryan Brewster <rpb@anthropic.com>
2026-04-08 17:19:15 +00:00
HirazawaUi
9f19fc42b5 Remove duplicate kubelet health checks 2026-04-07 22:58:37 +08:00
HirazawaUi
e59a6e7726 Add e2e test to ensure that the NotReady pod status does not change after kubelet restart 2026-04-07 22:13:08 +08:00
Paco Xu
287dbcf12a remove flaky label in SRIOV related tests 2026-04-07 09:30:26 +08:00
yashsingh74
afdb5e5d1f
Update CNI plugins to v1.9.1
Signed-off-by: yashsingh74 <yashsingh1774@gmail.com>
2026-04-01 14:06:34 +05:30
Davanum Srinivas
c2f0180463
test/e2e_node: fix CRI proxy event forwarding
The CRI proxy called GetContainerEvents synchronously, which blocked in
the upstream receive loop and prevented kubelet from receiving
container lifecycle events. With AllAlpha enabled, that breaks the
EventedPLEG path and leaves the restart and image-pull retry tests
dependent on delayed fallback relists.

Run the upstream event stream in a goroutine, tie it to the
downstream stream context, and propagate non-cancellation errors
after forwarding completes. Also restore the image-volume test to
look for the kubelet log message emitted when Image.Image is empty.
2026-03-31 18:44:22 -04:00
Davanum Srinivas
10efa46fbb
e2e_node: wait for pod drain before asserting zero pods in Memory Manager Metrics
The Memory Manager Metrics BeforeEach asserts that zero pods are
running on the node after a kubelet config update. This hard assertion
flakes when a preceding serial test's namespace deletion hasn't
completed yet — framework namespace cleanup is async and the kubelet
restart in updateKubeletConfig can delay in-flight pod termination.

CI logs show leftover pods from MemoryQoS tests (memqos-burstable,
memqos-no-limit, etc.), Probe Stress tests (50-container pods), and
Summary API PSI tests (memory-pressure-pod), all still Running when
the assertion fires 4-7ms after the previous test finishes.

Replace the immediate Expect(count).To(BeZero()) with an Eventually
poll (2 minute timeout, 5 second interval) that gives pods time to
drain after the kubelet restart. The existing printAllPodsOnNode
diagnostic output is preserved inside the poll for debugging.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2026-03-28 15:27:25 -04:00
Kubernetes Prow Robot
c6a95ffd4c
Merge pull request #137996 from pacoxu/inplace-disable
set InPlacePodLevelResourcesVerticalScaling to false if needed
2026-03-28 08:42:11 +05:30
Kubernetes Prow Robot
473b7635de
Merge pull request #138006 from tallclair/push-kooxxktxovkr
Flaky test fix for 'should restart failing container when pod restartPolicy is Always'
2026-03-25 02:18:16 +05:30
Xinyun Liu
990b72c522 Address comments and add more e2e tests 2026-03-24 17:52:45 +00:00