Commit graph

5591 commits

Author SHA1 Message Date
Nikhita Raghunath
4dd99967bd pkg/controller/job: re-honor exponential backoff
This commit makes the job controller re-honor exponential backoff for
failed pods. Before this commit, the controller created pods without any
backoff. This is a regression because the controller used to
create pods with an exponential backoff delay before (10s, 20s, 40s ...).

The issue occurs only when the JobTrackingWithFinalizers feature is
enabled (which is enabled by default right now). With this feature, we
get an extra pod update event when the finalizer of a failed pod is
removed.

Note that the pod failure detection and new pod creation happen in the
same reconcile loop so the 2nd pod is created immediately after the 1st
pod fails. The backoff is only applied on 2nd pod failure, which means
that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s
after the 3rd pod and so on.

This commit fixes a few bugs:

1. Right now, each time `uncounted != nil` and the job does not see a
_new_ failure, `forget` is set to true and the job is removed from the
queue. Which means that this condition is also triggered each time the
finalizer for a failed pod is removed and `NumRequeues` is reset, which
results in a backoff of 0s.

2. Updates `updatePod` to only apply backoff when we see a particular
pod failed for the first time. This is necessary to ensure that the
controller does not apply backoff when it sees a pod update event
for finalizer removal of a failed pod.

3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is
now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The
unit test for this check also had a few bugs:
    - `DefaultJobBackOff` is overwritten to 0 in certain unit tests,
    which meant that `DefaultJobBackOff` was considered to be 0,
    effectively not running any meaningful checks.
    - `JobsReadyPods` was not enabled for test cases that ran tests
    which required the feature gate to be enabled.
    - The check for expected and actual backoff had incorrect
    calculations.
2023-01-12 20:52:53 +05:30
Kubernetes Prow Robot
175142d771
Merge pull request #112084 from gjkim42/automated-cherry-pick-of-#109694-upstream-release-1.23
Automated cherry pick of #109694: Be sure to update the status of StatefulSet even if the new
2023-01-11 14:00:07 -08:00
Jordan Liggitt
203d8ac838
Generate and format files
- Run hack/update-codegen.sh
    - Run hack/update-generated-device-plugin.sh
    - Run hack/update-generated-protobuf.sh
    - Run hack/update-generated-runtime.sh
    - Run hack/update-generated-swagger-docs.sh
    - Run hack/update-openapi-spec.sh
    - Run hack/update-gofmt.sh

Replay of a9593d634c
2022-12-20 17:26:07 -05:00
Abirdcfly
ab0f90f3d2
Update golangci-lint to 1.46.2 and fix errors
Cherry-pick of 2bca77a3d9

Signed-off-by: Abirdcfly <fp544037857@gmail.com>
2022-12-20 17:26:02 -05:00
Gunju Kim
073dca06ae
Fix a conflict 2022-11-08 21:07:08 +09:00
Gunju Kim
65beed7952
StatefulSet: Cleanup the complex defer function updating the status
In the long term, the complex defer function makes the code harder to
maintain as code after it should take that into account. This removes
the complex defer function updating the status of a statefulset.
2022-11-08 20:48:26 +09:00
Aohan Yang
d737324312
Be sure to update the status of StatefulSet even if the new replica creation fails 2022-11-08 20:48:26 +09:00
Jakub Przychodzeń
6e3601cc72 NodeLifecycleController: Remove race condition
Patch request does not support RV by default, we need to include them explicitly and patching lists actually overwrites whole field. It means that there is a race condition, in which we can overwrite changes to taints that happened between GET and PATCH requests.
2022-10-25 14:09:02 +00:00
xing-yang
479f049df9 Fix unit test 2022-09-07 22:44:17 -04:00
ZhangKe10140699
62e1ea58c4 Fix problem in updating VolumeAttached in node status 2022-09-07 18:21:39 -04:00
Aldo Culquicondor
23e9d632ad Fix deleting UIDs tracking expectations
Change-Id: I5dad644cf5cb232ebed0950a14b35a781a38eeb0
2022-08-05 16:06:34 -04:00
Aldo Culquicondor
24b8252b10 Fix JobTrackingWithFinalizers when a pod succeeds after the job fails
Change-Id: I3be351fb3b53216948a37b1d58224f8fbbf22b47
2022-08-03 11:59:13 -04:00
Aldo Culquicondor
71a4c69a21 Do not skip job requeue in conflict error
Change-Id: Ie97977887a1cc3de58922d73dce92ae1965965bf
2022-07-08 20:20:04 +00:00
Harsha Narayana
3d1baf7ff2
GIT-110239: fix activeDeadlineSeconds enforcement bug
GIT-110239: add additional tests with preset Status.StartTime

GIT-110239: add additional tests with preset Status.StartTime

GIT-110294: cherry pick changes from #110771
2022-06-24 20:13:46 +05:30
Kubernetes Prow Robot
dde7b34269
Merge pull request #108879 from robscott/automated-cherry-pick-of-#108078-upstream-release-1.23
Automated cherry pick of #108078: Skip updating Endpoints and EndpointSlice if no relevant
2022-06-12 05:06:09 -07:00
Sanskar Jaiswal
6041228d19 move the ignore logic higher up to the reconciler
Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
2022-06-09 13:07:23 +05:30
Sanskar Jaiswal
0e1588c758 Ignore EndpointSlices that are already marked for deletion
Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
2022-06-09 13:07:23 +05:30
Rob Scott
67219f3045
Updating e2e test to check EndpointSlices and Endpoints as well 2022-05-27 23:40:10 +00:00
Antonio Ojea
e0fdecef81
endpoints controller: don't consider terminal endpoints
Terminal pods, whose phase its Failed or Succeeded, are guaranteed
to never regress and to be stopped, so their IPs never should
be published on the Endpoints.
2022-05-27 23:39:37 +00:00
Antonio Ojea
4e9638063f
endpointslices: terminal pods doesn't receive enpoints 2022-05-27 23:30:00 +00:00
Aldo Culquicondor
9b4dee8927 Wait for cache to sync in job's TestWatchOrphanPods
Otherwise the event handler might not be called.

Change-Id: I23c93c2251b411430a0f2469686db6355d84af2f
2022-05-11 13:07:24 -04:00
Aldo Culquicondor
11e6ec4396 Fix removing finalizer from finished jobs
In some rare race conditions, the job controller might create new pods after the job is declared finished.

Change-Id: I8a00429c8845463259cd7f82bb3c241d0011583c
2022-05-04 11:09:04 -04:00
Aldo Culquicondor
b680431336 Don't mark job as failed until expectations are satisfied
Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4
2022-05-04 11:09:04 -04:00
Aldo Culquicondor
f75e1b071d Remove finalizer when orphaned
Change-Id: Id88a28755660812a274dffab2693cb8a0ef4235c
2022-03-25 11:23:25 -04:00
Aldo Culquicondor
56d9c45895 Fix: Clean job tracking finalizer from orphan pods
Change-Id: I04cd70725fd1830be8daf2dca53f67bc10a379b7
2022-03-25 10:46:59 -04:00
Quan Tian
02f2986b85
Skip updating Endpoints and EndpointSlice if no relevant fields change
When comparing EndpointSubsets and Endpoints, we ignore the difference
in ResourceVersion of Pod to avoid unnecessary updates caused by Pod
updates that we don't care, e.g. annotation update.

Otherwise periodic Service resync would intensively update Endpoints or
EndpointSlice whose Pods have irrelevant change between two resyncs,
leading to delay in processing newly created Services. In a scale
cluster with thousands of such Endpoints, we observed 2 minutes of
delay when the resync happens.
2022-03-22 10:04:29 -07:00
Kubernetes Prow Robot
6e2d8d549c
Merge pull request #108306 from simonjpartridge/automated-cherry-pick-of-#107997-upstream-release-1.23
Automated cherry pick of #107997: cronjob_controllerv2: do not filter jobs to be reconciled by
2022-03-03 05:16:03 -08:00
Jean-Francois Remy
56bfc202e4 Add unit tests
- actual_state_of_world_test.go: test the new method GetVolumesToReportAttachedForNode
  for an existing node and a non-existing node
- node_status_updater_test.go: test UpdateNodeStatuses and UpdateNodeStatuses in nominal
  case with 2 nodes getting one volume each. Test UpdateNodeStatuses with the first call
  to node.patch failing but the following one succeeding
- add comment in node_status_updater.go
- fix log line in reconciler.go
- rename variable in actual_state_of_world.go
2022-03-02 10:48:18 -08:00
Jean-Francois Remy
a5faf0b5ce Fix nodes volumesAttached status not updated
The UpdateNodeStatuses code stops too early in case there is
an error when calling updateNodeStatus. It will return immediately
which means any remaining node won't have its update status put back
to true.

Looking at the call sites for UpdateNodeStatuses, it appears this is
not the only issue. If the lister call fails with anything but a Not Found
error, it's silently ignored which is wrong in the detach path.
Also the reconciler detach path calls UpdateNodeStatuses but the real intent
is to only update the node currently processed in the loop and not proceed
with the detach call if there is an error updating that specifi node volumesAttached
property. With the current implementation, it will not proceed if there is
an error updating another node (which is not completely bad but not ideal) and
worse it will proceed if there is a lister error on that node which means the
node volumesAttached property won't have been updated.

To fix those issues, introduce the following changes:
- [node_status_updater] introduce UpdateNodeStatusForNode which does what
  UpdateNodeStatuses does but only for the provided node
- [node_status_updater] if the node lister call fails for anything but a Not
  Found error, we will return an error, not ignore it
- [node_status_updater] if the update of a node volumesAttached properties fails
  we continue processing the other nodes
- [actual_state_of_world] introduce GetVolumesToReportAttachedForNode which
  does what GetVolumesToReportAttached but for the node whose name is provided
  it returns a bool which indicates if the node in question needs an update as
  well as the volumesAttached list. It is used by UpdateNodeStatusForNode
- [actual_state_of_world] use write lock in updateNodeStatusUpdateNeeded, we're
  modifying the map content
- [reconciler] use UpdateNodeStatusForNode in the detach loop
2022-03-02 10:47:28 -08:00
d-honeybadger
621894de9d cronjob_controllerv2: do not filter jobs to be reconciled by labels 2022-02-23 16:19:13 +00:00
Hemant Kumar
f61c4b18c4 use node informer to check volumes attachment status before backoff
fix unit tests
2022-01-06 12:15:56 -05:00
Rob Scott
271a9f0e58
Improving performance of EndpointSlice controller metrics cache 2021-12-21 08:45:31 -08:00
Aohan Yang
98cc4f9e96 fix the error when cleaning up jobs for cronjob 2021-12-20 08:45:10 -05:00
Matthew Cary
0e2b901762 Clean up deep copy needed for UpdateStatefulSet
Change-Id: Id732358183d682d1a945cfee56f83bcaac0d7c31
2021-11-23 06:48:54 -08:00
Kubernetes Prow Robot
084b28f6d5
Merge pull request #106510 from robscott/topology-ready-fix-controller
Updating TopologyCache to disregard unready endpoints in calculations
2021-11-19 17:07:11 -08:00
Rob Scott
9813ec7e8a
Updating TopologyCache to disregard unready endpoints in calculations 2021-11-18 13:54:09 -08:00
Matthew Cary
53b3a6c1d9 controller change for statefulset auto-delete (tests)
Change-Id: I16b50e6853bba65fc89c793d2b9b335581c02407
2021-11-17 16:48:50 -08:00
Matthew Cary
bce87a3e4f controller change for statefulset auto-delete (implementation) 2021-11-17 16:48:50 -08:00
Kubernetes Prow Robot
39c76ba2ed
Merge pull request #106455 from soltysh/cronjob_warning
Add warning about using unsupported CRON_TZ
2021-11-16 17:44:31 -08:00
Kubernetes Prow Robot
165b581759
Merge pull request #105623 from ash2k/ash2k/resettable-rest-mapper
ResettableRESTMapper to make it possible to reset wrapped mappers
2021-11-16 16:36:08 -08:00
Kubernetes Prow Robot
6805e6ee41
Merge pull request #104722 from leiyiz/migration
turning on the CSIMigrationGCE feature flag
2021-11-16 15:28:32 -08:00
Kubernetes Prow Robot
f151a40d8d
Merge pull request #106154 from gnufied/recover-expansion-failure-123
Recover expansion failure
2021-11-16 13:21:34 -08:00
Léiyì Zhang
275fdf0884 fixing unit test failures induced by turning on CSIMigrationGCE
disable CSIMigrationGCE in some unit tests
2021-11-16 19:26:30 +00:00
Maciej Szulik
d0518848b5
Add warning about using unsupported CRON_TZ
CRON_TZ variable slipped in during upgrading github.com/robfig/cron
library.  It allows setting a time zone which is a long requested
feature but one that is not officially supported. This adds warning
event since users should not rely on unsupported features.
2021-11-16 17:41:37 +01:00
Hemant Kumar
1ddd598d31 Implement controller and kubelet changes for recovery from resize
failures
2021-11-16 11:06:46 -05:00
Kubernetes Prow Robot
6d1d8c73ee
Merge pull request #106316 from josephburnett/controller-v2
Watch HPA v2 instead of v1.
2021-11-16 06:41:38 -08:00
Joseph Burnett
711f96e05e Watch HPA v2 instead of v1. 2021-11-16 11:13:21 +01:00
Kubernetes Prow Robot
ce98eda406
Merge pull request #106376 from jsafrane/stabilize-unit-test
Fix deletion protection unit test
2021-11-15 13:04:48 -08:00
Neha Lohia
fa1b6765d5
move pkg/util/node to component-helpers/node/util (#105347)
Signed-off-by: Neha Lohia <nehapithadiya444@gmail.com>
2021-11-12 07:52:27 -08:00
Jan Safranek
bb8157d780 Fix deletion protection unit test
The test should not depend on current set of default feature gates, it
should always ensure the ones necessary for the tests are set.
2021-11-12 10:47:15 +01:00