After replacing the command to increase UsageNanoCores, to fix a previous flaky test,
in some test environments, UsageNanoCores exceeds the limit 2e+09, this commit
attempts to fix this by ncreasing UsageNanoCores limit to 2e+10.
Replace the small echo write with a dd that uses conv=fsync to force
data through the block layer. Without fsync, the 11-byte echo writes
stay in page cache and never reach the block device within the
60-second test window. This leaves the cgroup io.stat empty, so
cadvisor does not emit container_blkio_device_usage_total,
container_fs_reads_bytes_total, or container_fs_writes_bytes_total
for the container.
The conv=fsync call guarantees block device I/O on every loop
iteration. Once io.stat has an entry for a device, all fields
(rbytes, wbytes, rios, wios) are present, even if zero, so all
cadvisor metrics pass their boundedSample(0, ...) checks.
Also increase the UsageCoreNanoSeconds upper bound from 1e11 to 1e12
for the container and pod-level CPU checks. The cumulative CPU time
can exceed 100s on slower architectures like ppc64le where the dd
CPU burner loop accumulates faster than expected.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
One podresources test, was not waiting for Pod Resources V1 to be serving.
This can lead to flaky tests in a next step.
This change attempts to fix this flaky test, by adding waitForPodResourcesV1Serving(ctx)
as done on remaining tests. In addition ExpectNoError was added to all closing connection
attempts, to improve troubleshooting.
The openssl speed command added in #138423 causes the ContainerMetrics
cadvisor test to fail on CRI-O conformance (ci-node-crio-conformance)
by exceeding upper bounds for container_fs_writes_total,
container_blkio_device_usage_total and container_memory_failures_total.
Replace with a lightweight dd-based CPU burner that generates CPU load
via syscall overhead without filesystem I/O or memory side effects.
Revert the bound changes to pre-#138423 values. The underlying
UsageNanoCores issue is better addressed at the kubelet level
by #138687.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Make the process in the container more cpu intensive to make sure
we catch CPU usage more than nanocore, within the test window to overcome
a known limitation in older containerd versions.
Increase UsageNanoCores and UsageCoreNanoSeconds boundaries, to cater for
the additional cpu loads.
Improve testdeviceplugin to healthcheck kubelet and
fail early if kubelet is not healthy.
Check sampledeviceplugin pod logs, and perform manual registration
only after the container has entered the registration loop.
Add printouts of sampledeviceplugin pod after each device-plugin-test
test, for troubleshooting.
Fix flaky test upon failed admission in device_plugin_test, ensuring containers are stopped,
and then by checking first that number of device plugins are one before checking the containers
matching devices.
Fix Resources API SRIOV flaky test, by cleaning up pods Before Each test step.
Clean up pod-stress and memory-qos test pods AfterEach test step.
The test was calling teardownSRIOVConfigOrFail at the end of the function,
which meant resources would not be cleaned up if any test failed midway.
Using ginkgo.DeferCleanup ensures proper cleanup even on test failure.
The format package is used by ktesting, both to reconfigure Gomega and to
format errors, therefore it has to be moved to staging together with ktesting,
if or when we get to that because those are desirable features.
Because format only has the YAML package as additional dependency and that
should be okay for all other repos (except for the YAML package itself, of
course), we can publish the format package as a sub-package of such a future
ktesting module.
Avoiding the dependency on apimachinery to detect unstructured.Unstructured is
a bit tricky, but doable by relaxing what we check for. The test/utils/format
package is kept to test ktesting/format with the actual packages that it cannot
depend on (apimachinery, api).
containerEphemeralStorageLimitEviction() only iterated
pod.Spec.Containers when building the per-container ephemeral-storage
threshold map. Restartable init containers (sidecars) were never
checked against their declared limit, allowing them to exceed it
indefinitely without triggering eviction.
Include restartable init containers in the threshold map so the
existing per-container comparison covers them.
Add a dedicated graceful shutdown e2e_node case to verify that the node lease continues to renew while shutdown is active.
The test uses an extended shutdown window, configures the kubelet lease cadence explicitly, waits for the node to report Ready=False with reason KubeletNotReady, and then checks that the lease renewTime advances multiple times before shutdown completes.
The test starts the kubelet with a non-default setting for idsPerPod,
runs a pod, deletes it, and then restarts the kubelet.
The issue is that the kubelet guarantees that no two pods userns
mappings overlap (for security reasons). But we are not waiting for the
pod to be removed, the deleteSync() call only waits for the API server
to remove the pod.
So, the pod is on disk (and maybe even running!) when we restart the
kubelet. As the previous configuration is incompatible with the new one
after restart if pods are running, the kubelet failing is the right
thing. We should just wait for the pod to be deleted from the kubelet
too, before restarting it with an incompatible configuration.
So, this commit just changes the pod deleteion (before done in
e2eoutput.TestContainerOutput() just waiting for the API server) to wait
for the kubelet to delete the pod.
Signed-off-by: Rodrigo Campos <rodrigo@amutable.com>
When the cpuAlloc check at podresources_test.go:1358 fires, it Skip()s
from a JustBeforeEach at the outer When() level. By that point the inner
When()'s tempSetCurrentKubeletConfig BeforeEach has already rewritten
the kubelet config (including, for the "restricted list output disabled"
block, setting KubeletPodResourcesListUseActivePods=false).
Ginkgo only runs AfterEach hooks at-or-shallower than the node where
Skip() fired (internal/group.go:252), so the inner AfterEach that would
restore the kubelet config is never invoked. The leaked feature gate
then propagates to every subsequent serial test, which breaks
device_plugin_test.go's "Does not keep device plugin assignments across
node reboots if fails admission" on e2-standard-2 nodes.
Moving the check to BeforeEach makes it fire before the inner
BeforeEach runs, so the config is never written. This matches the
identical check at podresources_test.go:1120.
Signed-off-by: Ryan Brewster <rpb@anthropic.com>
The CRI proxy called GetContainerEvents synchronously, which blocked in
the upstream receive loop and prevented kubelet from receiving
container lifecycle events. With AllAlpha enabled, that breaks the
EventedPLEG path and leaves the restart and image-pull retry tests
dependent on delayed fallback relists.
Run the upstream event stream in a goroutine, tie it to the
downstream stream context, and propagate non-cancellation errors
after forwarding completes. Also restore the image-volume test to
look for the kubelet log message emitted when Image.Image is empty.
The Memory Manager Metrics BeforeEach asserts that zero pods are
running on the node after a kubelet config update. This hard assertion
flakes when a preceding serial test's namespace deletion hasn't
completed yet — framework namespace cleanup is async and the kubelet
restart in updateKubeletConfig can delay in-flight pod termination.
CI logs show leftover pods from MemoryQoS tests (memqos-burstable,
memqos-no-limit, etc.), Probe Stress tests (50-container pods), and
Summary API PSI tests (memory-pressure-pod), all still Running when
the assertion fires 4-7ms after the previous test finishes.
Replace the immediate Expect(count).To(BeZero()) with an Eventually
poll (2 minute timeout, 5 second interval) that gives pods time to
drain after the kubelet restart. The existing printAllPodsOnNode
diagnostic output is preserved inside the poll for debugging.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>