Added a test verifying that when both a device plugin and a DRA
driver advertise the same resource on one node, the device plugin
wins (filterExtendedResources takes the DRA path only when
allocatable == 0).
Asserted ExtendedResourceClaimStatus in the existing "process
extended resources after device plugin uninstall" test to confirm
the DRA path is taken after DP removal.
Drops f.WithFlaky() from two test blocks where the tag has become stale:
- [sig-node] kubelet host cleanup with volume mounts [HostCleanup]
(covers both NFS sub-tests: active and sleeping client pods)
- [sig-storage] PersistentVolumes-local "should set different fsGroup
for second pod if first pod is deleted" (covers all 8 volume-type
variants from the parameterized parent)
Testgrid evidence -- both dashboards show consistent passes across all
30 recent runs:
https://testgrid.k8s.io/google-gce#gci-gce-flaky&include-filter-by-regex=Flakyhttps://testgrid.k8s.io/sig-testing-misc#gce-cos-master-flaky-repro&include-filter-by-regex=Flaky
History:
- HostCleanup was tagged [Flaky] in PR 41659 (merged 2017-04-13) as a
quick workaround for parallel-execution interference with disruptive
tests; the follow-up "remove [Flaky]" PR mentioned in that body never
landed. Root-cause issue 31272 ("Hung volumes can wedge the kubelet")
remains open.
- fsGroup test was tagged [Flaky] in PR 75015 (merged 2019-03-06) to
skip a race in DesiredStateOfWorld re-adding terminating-pod volumes.
Root-cause issue 73168 ("Do not remount volume again after it is
detached") remains open. The obsolete TODO comment referencing that
issue is also removed.
If either test regresses, the safe rollback is to restore f.WithFlaky()
and reopen the conversation on issue 31272 / 73168.
Two Windows e2e tests, Memory Limits and Kubelet-Stats, compute
"schedulable memory" directly from node.Status.Allocatable.Memory and
ignore pod.Spec.Overhead. That is inconsistent with how the kubelet
admits pods: admission-time accounting sums each pod's container
requests plus pod.Spec.Overhead (Pod Overhead, KEP-688, GA in 1.24).
On clusters whose admission chain injects per-pod overhead, for
example a cluster with a RuntimeClass whose scheduling overhead is
non-zero, or a mutating webhook that sets Spec.Overhead, these tests
overschedule the node and fail with OutOfmemory admission errors. On
clusters with no overhead the tests behave the same as before.
Add three helpers in test/e2e/windows/utils.go:
- detectPodOverheadMemory(ctx, c, namespace) (int64, error): performs
a single DryRun pod create and inspects the mutated result for
Spec.Overhead[ResourceMemory]. Result and error are cached for the
lifetime of the test process via sync.Once. DryRun is the right
primitive because admission webhooks may inject overhead
conditionally on namespace, labels, or other request-scoped data
that is not visible from a static read of the RuntimeClass API.
- sumExistingPodMemoryReservation(ctx, c, nodeName): sums per-pod
container requests + Spec.Overhead for non-terminal pods on a
node. Used to leave room for DaemonSets and system pods.
- waitForNodeMemoryToSettle(ctx, c, nodeName, neededBytes): polls
until enough memory frees up after a previous [Serial] test;
on timeout logs a tagged "did NOT settle" message but does not
fail the test.
Adopt the helpers in:
- memory_limits.go: subtract overhead + existing reservation +
safety buffer (256 MiB) from Allocatable.Memory when sizing the
consume pod, instead of subtracting a hard-coded 100 MiB.
- kubelet_stats.go (10-pod test): compute maxPods = (allocatable -
existing - safetyBuffer) / overhead, lower numPods accordingly,
and skip cleanly when fewer than 3 pods can fit.
- kubelet_stats.go (3-pod test): apply the same
skip-on-insufficient-room logic.
Behavior on clusters without Pod Overhead is byte-for-byte
equivalent: the helpers short-circuit and the existing per-test
code paths are unchanged.
Make the process in the container more cpu intensive to make sure
we catch CPU usage more than nanocore, within the test window to overcome
a known limitation in older containerd versions.
Increase UsageNanoCores and UsageCoreNanoSeconds boundaries, to cater for
the additional cpu loads.
The GCE node image family was updated to cos-125-lts but the
nvidia-driver-installer DaemonSet image was never bumped to match.
cos-gpu-installer:v2.5.7 is only suitable for COS M121; it crashes
(CrashLoopBackOff) on cos-125-19216-220-150 nodes, blocking GPU driver
installation and causing all GPU e2e tests to time out.
Bump to v2.5.8, the first release in the COS M125 release notes:
https://cloud.google.com/container-optimized-os/docs/release-notes/m125
75448c416b added feature gate dependencies at the end of a test
name. However, if those tags were already part of the previous text, either
because they were explicitly added in the current node or in some parent node,
then redundant tags were added.
Now this special case is detected and such redundant tags do not get added
again.
This shouldn't substantially change which tests run in jobs (an on-by-default
beta feature can only depend on other on-by-default features, for example), but
it makes the FeatureGate list in the test name more complete.
The additional feature gate names are treated like additional meta data and get
added at the end of the full test name.