Drops f.WithFlaky() from two test blocks where the tag has become stale:
- [sig-node] kubelet host cleanup with volume mounts [HostCleanup]
(covers both NFS sub-tests: active and sleeping client pods)
- [sig-storage] PersistentVolumes-local "should set different fsGroup
for second pod if first pod is deleted" (covers all 8 volume-type
variants from the parameterized parent)
Testgrid evidence -- both dashboards show consistent passes across all
30 recent runs:
https://testgrid.k8s.io/google-gce#gci-gce-flaky&include-filter-by-regex=Flakyhttps://testgrid.k8s.io/sig-testing-misc#gce-cos-master-flaky-repro&include-filter-by-regex=Flaky
History:
- HostCleanup was tagged [Flaky] in PR 41659 (merged 2017-04-13) as a
quick workaround for parallel-execution interference with disruptive
tests; the follow-up "remove [Flaky]" PR mentioned in that body never
landed. Root-cause issue 31272 ("Hung volumes can wedge the kubelet")
remains open.
- fsGroup test was tagged [Flaky] in PR 75015 (merged 2019-03-06) to
skip a race in DesiredStateOfWorld re-adding terminating-pod volumes.
Root-cause issue 73168 ("Do not remount volume again after it is
detached") remains open. The obsolete TODO comment referencing that
issue is also removed.
If either test regresses, the safe rollback is to restore f.WithFlaky()
and reopen the conversation on issue 31272 / 73168.
Document why cuda-samples is pinned to v12.5 rather than the latest
tag: it has to match the CUDA 12.5 toolkit in the base image and the
cuda-demo-suite-12-5 apt package used on x86_64. v13+ cuda-samples
also requires CUDA Toolkit 13.x and switched from make to CMake, so
bumping is a coordinated change across base image, apt package, git
tag, and build commands.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The [Feature:GPUDevicePlugin] Sanity test embeds
`apt-get install -y cuda-demo-suite-12-5` under `set -e`. NVIDIA's CUDA
apt repo publishes cuda-demo-suite-* for x86_64 but NOT for sbsa
(confirmed against the public Packages index on
developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/{sbsa,x86_64}/).
On arm64 the install fails, the container exits 1, pod.Status.Phase
becomes Failed, and the subsequent `gomega.Expect(... .Equal(Succeeded))`
assertion trips.
Split the demo phase on architecture. On x86_64 keep the existing apt
path unchanged. On anything else, build deviceQuery / vectorAdd /
bandwidthTest from the public NVIDIA/cuda-samples repo instead.
busGrind is exclusive to cuda-demo-suite (no source equivalent in
cuda-samples) and is skipped on non-x86_64.
The pattern is the one already in production use by
sigs.k8s.io/dra-driver-nvidia-gpu in tests/bats/specs/gpu-cuda-demo-suite.yaml,
which has been green on Lambda gpu_1x_gh200.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The fast-delete pod status tests currently require the intentionally failing
"fail" container to report exit code 1. In CI, some runtimes occasionally
report exit code 2 with reason=Error even though the tested invariant still
holds: the container failed and the blocked workload container never started.
The latest dims/test-k8s failure on master showed exactly that state: the pod
remained Failed, Initialized=False, the blocked container reported
started=false, and only the failing init container drifted from exit 1 to exit
2. This matches kubernetes/kubernetes issue 135713 and the related
pending-container history in PR 131605.
Accept exit code 2 in this verifier so the test continues to assert the
behavior it is meant to cover instead of a lower-layer exit-code detail.
Fixes issue 135713
Tested:
- hack/verify-gofmt.sh
- hack/verify-test-code.sh
- hack/verify-typecheck.sh ./test/e2e/node/...
- go test ./test/e2e/node -run TestNonExistent -count=1
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
Add a new ALPHA stability metric terminated_containers_total to track
container terminations (both successful and failed). This metric provides
aggregate visibility into container exit patterns across the node,
supporting detection of abnormal exits (e.g., SIGSEGV, OOMKilled) and
enabling error-rate calculations.
To ensure node stability and comply with Kubernetes instrumentation
standards, the metric uses the following low-cardinality labels:
- container_type (container, init_container, or ephemeral_container)
- exit_code (the literal exit status)
- reason (the termination reason from the runtime)
High-cardinality labels (container_name, namespace_name) are deliberately
omitted to prevent metric cardinality explosion. Problematic containers can
be identified via standard troubleshooting workflows using Kubernetes
Events or API status.
Included:
- Metric definition and registration in metrics.go.
- Status manager implementation to record transitions exactly once.
- Unit tests in status_manager_test.go verifying success/failure logic.
- Node e2e test to verify correct metrics exposure.
generation test into pod_admission.go and commenting out
PodReadyToStartContainers. Conformance promotion will follow in a
separate PR once this lands green, per review feedback.
The test uses an invalid image to induce a pull error. The previous image
name 'some-image-that-doesnt-exist' causes slow DNS/registry resolution
on some environments (especially metal), leading to 30s timeouts.
Using 'localhost/some-image-that-does-not-exist' makes the pull fail
instantly since there is no registry on localhost, avoiding flaky
timeouts.
The spaces are unnecessary because Ginkgo adds spaces automatically.
This was detected before only for tests using the wrapper functions,
now it also gets detected for ginkgo methods.
A couple of tests were recently promoted to conformance
but they did not include a minimimum kubelet version,
which broke the kubeadm/kinder e2e jobs that skew the kubelet
version against the apiserver version.
This doubles the termination timeout for the eviction test from 5min to
10min. Reason for that is that the eviction manager relies on pod stats
metrics, which may not be acceessible during a period of time because of
the kubelet API unreachable. This could be reasoned in hardware or
network pressure when multiple tests run in parallel.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>