The fast-delete pod status tests currently require the intentionally failing
"fail" container to report exit code 1. In CI, some runtimes occasionally
report exit code 2 with reason=Error even though the tested invariant still
holds: the container failed and the blocked workload container never started.
The latest dims/test-k8s failure on master showed exactly that state: the pod
remained Failed, Initialized=False, the blocked container reported
started=false, and only the failing init container drifted from exit 1 to exit
2. This matches kubernetes/kubernetes issue 135713 and the related
pending-container history in PR 131605.
Accept exit code 2 in this verifier so the test continues to assert the
behavior it is meant to cover instead of a lower-layer exit-code detail.
Fixes issue 135713
Tested:
- hack/verify-gofmt.sh
- hack/verify-test-code.sh
- hack/verify-typecheck.sh ./test/e2e/node/...
- go test ./test/e2e/node -run TestNonExistent -count=1
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
Add a new ALPHA stability metric terminated_containers_total to track
container terminations (both successful and failed). This metric provides
aggregate visibility into container exit patterns across the node,
supporting detection of abnormal exits (e.g., SIGSEGV, OOMKilled) and
enabling error-rate calculations.
To ensure node stability and comply with Kubernetes instrumentation
standards, the metric uses the following low-cardinality labels:
- container_type (container, init_container, or ephemeral_container)
- exit_code (the literal exit status)
- reason (the termination reason from the runtime)
High-cardinality labels (container_name, namespace_name) are deliberately
omitted to prevent metric cardinality explosion. Problematic containers can
be identified via standard troubleshooting workflows using Kubernetes
Events or API status.
Included:
- Metric definition and registration in metrics.go.
- Status manager implementation to record transitions exactly once.
- Unit tests in status_manager_test.go verifying success/failure logic.
- Node e2e test to verify correct metrics exposure.
generation test into pod_admission.go and commenting out
PodReadyToStartContainers. Conformance promotion will follow in a
separate PR once this lands green, per review feedback.
The test uses an invalid image to induce a pull error. The previous image
name 'some-image-that-doesnt-exist' causes slow DNS/registry resolution
on some environments (especially metal), leading to 30s timeouts.
Using 'localhost/some-image-that-does-not-exist' makes the pull fail
instantly since there is no registry on localhost, avoiding flaky
timeouts.
The spaces are unnecessary because Ginkgo adds spaces automatically.
This was detected before only for tests using the wrapper functions,
now it also gets detected for ginkgo methods.
A couple of tests were recently promoted to conformance
but they did not include a minimimum kubelet version,
which broke the kubeadm/kinder e2e jobs that skew the kubelet
version against the apiserver version.
This doubles the termination timeout for the eviction test from 5min to
10min. Reason for that is that the eviction manager relies on pod stats
metrics, which may not be acceessible during a period of time because of
the kubelet API unreachable. This could be reasoned in hardware or
network pressure when multiple tests run in parallel.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>