Commit graph

505 commits

Author SHA1 Message Date
Jefftree
7fe9bbb5c5 e2e: skip HostCleanup test when worker has no NodeExternalIP 2026-05-15 15:45:52 -04:00
Davanum Srinivas
1f59ea104a
Remove [Flaky] for green tests
Drops f.WithFlaky() from two test blocks where the tag has become stale:

- [sig-node] kubelet host cleanup with volume mounts [HostCleanup]
  (covers both NFS sub-tests: active and sleeping client pods)
- [sig-storage] PersistentVolumes-local "should set different fsGroup
  for second pod if first pod is deleted" (covers all 8 volume-type
  variants from the parameterized parent)

Testgrid evidence -- both dashboards show consistent passes across all
30 recent runs:
  https://testgrid.k8s.io/google-gce#gci-gce-flaky&include-filter-by-regex=Flaky
  https://testgrid.k8s.io/sig-testing-misc#gce-cos-master-flaky-repro&include-filter-by-regex=Flaky

History:
- HostCleanup was tagged [Flaky] in PR 41659 (merged 2017-04-13) as a
  quick workaround for parallel-execution interference with disruptive
  tests; the follow-up "remove [Flaky]" PR mentioned in that body never
  landed. Root-cause issue 31272 ("Hung volumes can wedge the kubelet")
  remains open.
- fsGroup test was tagged [Flaky] in PR 75015 (merged 2019-03-06) to
  skip a race in DesiredStateOfWorld re-adding terminating-pod volumes.
  Root-cause issue 73168 ("Do not remount volume again after it is
  detached") remains open. The obsolete TODO comment referencing that
  issue is also removed.

If either test regresses, the safe rollback is to restore f.WithFlaky()
and reopen the conversation on issue 31272 / 73168.
2026-05-11 08:26:29 -04:00
zak905
04286814e7 clean up: remove loop variable capture 2026-04-28 23:53:27 +02:00
Kubernetes Prow Robot
ff06de939d
Merge pull request #134950 from Karthik-K-N/fix-inplace-flake
[Flaking test] [InPlacePodVerticalScaling] Fix Pod Resize deferred tests
2026-04-25 11:12:46 +05:30
Davanum Srinivas
0934916b90
test/e2e/node: explain v12.5 pin for cuda-samples on arm64
Document why cuda-samples is pinned to v12.5 rather than the latest
tag: it has to match the CUDA 12.5 toolkit in the base image and the
cuda-demo-suite-12-5 apt package used on x86_64. v13+ cuda-samples
also requires CUDA Toolkit 13.x and switched from make to CMake, so
bumping is a coordinated change across base image, apt package, git
tag, and build commands.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2026-04-20 07:07:50 -04:00
Davanum Srinivas
6db917c42e
Update test/e2e/node/gpu.go
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
2026-04-20 07:00:54 -04:00
Davanum Srinivas
ad41961d32
test/e2e/node: make GPU sanity test work on arm64 (sbsa)
The [Feature:GPUDevicePlugin] Sanity test embeds
`apt-get install -y cuda-demo-suite-12-5` under `set -e`. NVIDIA's CUDA
apt repo publishes cuda-demo-suite-* for x86_64 but NOT for sbsa
(confirmed against the public Packages index on
developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/{sbsa,x86_64}/).
On arm64 the install fails, the container exits 1, pod.Status.Phase
becomes Failed, and the subsequent `gomega.Expect(... .Equal(Succeeded))`
assertion trips.

Split the demo phase on architecture. On x86_64 keep the existing apt
path unchanged. On anything else, build deviceQuery / vectorAdd /
bandwidthTest from the public NVIDIA/cuda-samples repo instead.
busGrind is exclusive to cuda-demo-suite (no source equivalent in
cuda-samples) and is skipped on non-x86_64.

The pattern is the one already in production use by
sigs.k8s.io/dra-driver-nvidia-gpu in tests/bats/specs/gpu-cuda-demo-suite.yaml,
which has been green on Lambda gpu_1x_gh200.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2026-04-19 18:55:08 -04:00
Kubernetes Prow Robot
0e71d2d28f
Merge pull request #137749 from dims/dsrinivas/issue-135713-pod-status-exit-2
test/e2e/node: tolerate exit code 2 in pod status flake
2026-03-21 23:28:24 +05:30
Davanum Srinivas
d4181e8c20
test/e2e/node: tolerate exit code 2 in pod status flake
The fast-delete pod status tests currently require the intentionally failing
"fail" container to report exit code 1. In CI, some runtimes occasionally
report exit code 2 with reason=Error even though the tested invariant still
holds: the container failed and the blocked workload container never started.

The latest dims/test-k8s failure on master showed exactly that state: the pod
remained Failed, Initialized=False, the blocked container reported
started=false, and only the failing init container drifted from exit 1 to exit
2. This matches kubernetes/kubernetes issue 135713 and the related
pending-container history in PR 131605.

Accept exit code 2 in this verifier so the test continues to assert the
behavior it is meant to cover instead of a lower-layer exit-code detail.

Fixes issue 135713

Tested:
- hack/verify-gofmt.sh
- hack/verify-test-code.sh
- hack/verify-typecheck.sh ./test/e2e/node/...
- go test ./test/e2e/node -run TestNonExistent -count=1

Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
2026-03-21 15:30:46 +01:00
Kubernetes Prow Robot
7a3a6cf4be
Merge pull request #136725 from pravk03/native-dra-2
Introduce support of DRA for Native Resources
2026-03-19 03:36:38 +05:30
Praveen Krishna
cdcfc4eeb3 Add integration tests. 2026-03-18 19:20:10 +00:00
Kubernetes Prow Robot
27b42dd16d
Merge pull request #137453 from rawsocket/master
kubelet: add terminated_containers_total metric
2026-03-18 23:20:49 +05:30
Adel Abouchaev
1a49c37b77 kubelet: add terminated_containers_total metric
Add a new ALPHA stability metric terminated_containers_total to track
  container terminations (both successful and failed). This metric provides
  aggregate visibility into container exit patterns across the node,
  supporting detection of abnormal exits (e.g., SIGSEGV, OOMKilled) and
  enabling error-rate calculations.

  To ensure node stability and comply with Kubernetes instrumentation
  standards, the metric uses the following low-cardinality labels:
   - container_type (container, init_container, or ephemeral_container)
   - exit_code (the literal exit status)
   - reason (the termination reason from the runtime)

  High-cardinality labels (container_name, namespace_name) are deliberately
  omitted to prevent metric cardinality explosion. Problematic containers can
  be identified via standard troubleshooting workflows using Kubernetes
  Events or API status.

  Included:
   - Metric definition and registration in metrics.go.
   - Status manager implementation to record transitions exactly once.
   - Unit tests in status_manager_test.go verifying success/failure logic.
   - Node e2e test to verify correct metrics exposure.
2026-03-18 02:22:29 +00:00
Kubernetes Prow Robot
e1be691e7f
Merge pull request #136043 from natasha41575/os_feasibility
[InPlacePodVerticalScaling] create an admission plugin to perform the OS and node capacity checks
2026-03-18 03:23:39 +05:30
Natasha Sarkar
fd8c6d3e2e add pod resize feasibility check admission plugin 2026-03-17 17:12:31 +00:00
Kubernetes Prow Robot
9c7e57bb7c
Merge pull request #137330 from tico88612/cleanup/test-node-pod-dep-prometheus
Remove dep. Prometheus from test/e2e/node/pods.go
2026-03-16 20:43:49 +05:30
Sergey Kanzhelev
9aee7c917a wait for container condition to be true before sending the pod update 2026-03-13 23:21:22 +00:00
ChengHao Yang
195b9f598d
Remove dep. Prometheus from test/e2e/node/pods.go
Add the MetricFamilyToText in `component-base/metric/testutil`

Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
2026-03-11 19:14:35 +08:00
Yuan Wang
f33a2767aa Refactor container restart policy tests to e2e/common/node
- Added validation for lastTerminationStatus
2026-03-09 23:05:05 +00:00
Mads Jensen
1f2b70a043 Lint: Use modernize/rangeint in test/{e2e,e2e_node,images,soak} 2026-03-07 10:17:31 +01:00
Yuan Wang
906134cee9 Update pod after the container is removed
Ensures the single-container pod can restart quickly
2026-03-05 23:21:33 +00:00
Kubernetes Prow Robot
dd0958fece
Merge pull request #136851 from jiefeng-xu/jiefeng/fix-gpu-flake-136378
test/e2e/node: reduce flakiness in GPU nvidia-smi test
2026-03-04 08:56:17 +05:30
Kevin Hannon
b26954bc0f merging the pod rejection
generation test into pod_admission.go and commenting out
PodReadyToStartContainers. Conformance promotion will follow in a
separate PR once this lands green, per review feedback.
2026-03-03 13:58:04 -05:00
Karthik Bhat
1f9a751ec1 Address review comments by using 2 pods instead of 3 pods and simlify the logic. 2026-03-03 11:46:54 +05:30
Jiefeng Xu
b738ae6d97 test/e2e/node: handle quick pod completion in GPU startup wait 2026-03-01 11:50:57 -08:00
Chandan Maurya
e54eef10d1 Use localhost image reference in PodObservedGenerationTracking test
The test uses an invalid image to induce a pull error. The previous image
name 'some-image-that-doesnt-exist' causes slow DNS/registry resolution
on some environments (especially metal), leading to 30s timeouts.

Using 'localhost/some-image-that-does-not-exist' makes the pull fail
instantly since there is no registry on localhost, avoiding flaky
timeouts.
2026-02-26 10:04:00 +05:30
Kubernetes Prow Robot
9dc55d7d9e
Merge pull request #135729 from yangjunmyfm192085/fixe2e2
test/e2e: e2e test cases `should support seccomp default, which is unconfined [LinuxOnly]`. Execution failed.
2026-02-11 09:26:08 +05:30
杨军10092085
d94808665c e2e test cases should support seccomp default, which is unconfined [LinuxOnly]. Execution failed. 2026-02-11 08:17:31 +08:00
Jiefeng Xu
6e203664eb test/e2e/node: reduce flakiness in GPU nvidia-smi test 2026-02-08 22:40:45 -08:00
Karthik Bhat
a3d241347c Resize pod to request for more cpu so it will remain in deffered state 2026-01-29 15:11:24 +05:30
Mads Jensen
757647786d Remove redundant re-assignments in for-loops in test/{e2e,integration,utils}
The modernize forvar rule was applied. There are more details in this blog
post: https://go.dev/blog/loopvar-preview
2026-01-25 22:58:27 +01:00
Sotiris Salloumis
d9c3ec29ad Move getNodeAllocatableAndAvailableValues to framework
To allow use of this good method from future tests using
e2enode test framework.
2026-01-21 19:41:08 +01:00
Patrick Ohly
47d02070ba E2E: remove unnecessary trailing spaces in test names
The spaces are unnecessary because Ginkgo adds spaces automatically.

This was detected before only for tests using the wrapper functions,
now it also gets detected for ginkgo methods.
2026-01-07 12:05:43 +01:00
ndixita
10b73f8ef9
Test fixes
Signed-off-by: ndixita <ndixita@google.com>
2025-11-12 06:21:06 +00:00
ndixita
1733d8fc8c
e2e tests
Signed-off-by: ndixita <ndixita@google.com>
2025-11-11 18:19:09 +00:00
ndixita
efc3126b76
Adding Resources and AllocatedResoures fields to the list of expected fields in PodStatus in admission test 2025-11-11 18:15:20 +00:00
Yuan Wang
0b47a37861 Keep pod in running state and prune past container status from runtime 2025-11-11 06:37:49 +00:00
Yuan Wang
aac951d902 Add dependency for NodeDeclaredFeatures 2025-11-10 09:41:02 +00:00
Yuan Wang
2eb1eeeabf add disruptive tests 2025-11-10 09:41:02 +00:00
Yuan Wang
83c5cd5526 Implement restartPod action 2025-11-10 09:41:02 +00:00
Lubomir I. Ivanov
396a7c1a12 test/e2e/node: add minimum kubelet version to some pod tests
A couple of tests were recently promoted to conformance
but they did not include a minimimum kubelet version,
which broke the kubeadm/kinder e2e jobs that skew the kubelet
version against the apiserver version.
2025-11-05 12:06:47 +02:00
Natasha Sarkar
2a217a9bfd promote pod generation tests to conformance 2025-10-29 20:57:59 +00:00
Natasha Sarkar
21c832b47d promote pod generation to GA 2025-10-29 15:52:17 +00:00
Kubernetes Prow Robot
c7f910ed1f
Merge pull request #133762 from natasha41575/expandQuotaTests
[InPlacePodVerticalScaling] Expand coverage for resourceQuota and limitRanger e2e tests
2025-10-02 00:10:56 -07:00
Michael Aspinwall
84f85712be feat: Add matcher and conformance tests ensuring that RV is uint128 2025-10-01 00:01:50 +00:00
Michael Aspinwall
37fcfcd29e feat: Add conformance tests for all resources for comparable resource version 2025-09-29 23:32:07 +00:00
Natasha Sarkar
89b75e998d expand coverage for resource quota and limit ranger tests 2025-09-19 15:44:42 +00:00
Mauricio Poppe
55700685bd
Revert "Add retries to node's crictl test" 2025-09-08 20:35:31 -04:00
Sascha Grunert
c8f8f66e6d
Increase termination timeout for evicted pods should be terminal test
This doubles the termination timeout for the eviction test from 5min to
10min. Reason for that is that the eviction manager relies on pod stats
metrics, which may not be acceessible during a period of time because of
the kubelet API unreachable. This could be reasoned in hardware or
network pressure when multiple tests run in parallel.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-09-03 08:58:46 +02:00
Natasha Sarkar
f1d980adf9 separate resource-quota and limit-ranger resize tests 2025-08-28 15:56:10 +00:00