kubernetes

mirror of https://github.com/kubernetes/kubernetes.git synced 2026-06-09 00:34:10 -04:00

Author	SHA1	Message	Date
Nikhita Raghunath	4dd99967bd	pkg/controller/job: re-honor exponential backoff This commit makes the job controller re-honor exponential backoff for failed pods. Before this commit, the controller created pods without any backoff. This is a regression because the controller used to create pods with an exponential backoff delay before (10s, 20s, 40s ...). The issue occurs only when the JobTrackingWithFinalizers feature is enabled (which is enabled by default right now). With this feature, we get an extra pod update event when the finalizer of a failed pod is removed. Note that the pod failure detection and new pod creation happen in the same reconcile loop so the 2nd pod is created immediately after the 1st pod fails. The backoff is only applied on 2nd pod failure, which means that the 3rd pod created 10s after the 2nd pod, 4th pod is created 20s after the 3rd pod and so on. This commit fixes a few bugs: 1. Right now, each time `uncounted != nil` and the job does not see a _new_ failure, `forget` is set to true and the job is removed from the queue. Which means that this condition is also triggered each time the finalizer for a failed pod is removed and `NumRequeues` is reset, which results in a backoff of 0s. 2. Updates `updatePod` to only apply backoff when we see a particular pod failed for the first time. This is necessary to ensure that the controller does not apply backoff when it sees a pod update event for finalizer removal of a failed pod. 3. If `JobsReadyPods` feature is enabled and backoff is 0s, the job is now enqueued after `podUpdateBatchPeriod` seconds, instead of 0s. The unit test for this check also had a few bugs: - `DefaultJobBackOff` is overwritten to 0 in certain unit tests, which meant that `DefaultJobBackOff` was considered to be 0, effectively not running any meaningful checks. - `JobsReadyPods` was not enabled for test cases that ran tests which required the feature gate to be enabled. - The check for expected and actual backoff had incorrect calculations.	2023-01-12 20:52:53 +05:30
Kubernetes Prow Robot	175142d771	Merge pull request #112084 from gjkim42/automated-cherry-pick-of-#109694-upstream-release-1.23 Automated cherry pick of #109694: Be sure to update the status of StatefulSet even if the new	2023-01-11 14:00:07 -08:00
Jordan Liggitt	203d8ac838	Generate and format files - Run hack/update-codegen.sh - Run hack/update-generated-device-plugin.sh - Run hack/update-generated-protobuf.sh - Run hack/update-generated-runtime.sh - Run hack/update-generated-swagger-docs.sh - Run hack/update-openapi-spec.sh - Run hack/update-gofmt.sh Replay of `a9593d634c`	2022-12-20 17:26:07 -05:00
Abirdcfly	ab0f90f3d2	Update golangci-lint to 1.46.2 and fix errors Cherry-pick of `2bca77a3d9` Signed-off-by: Abirdcfly <fp544037857@gmail.com>	2022-12-20 17:26:02 -05:00
Gunju Kim	073dca06ae	Fix a conflict	2022-11-08 21:07:08 +09:00
Gunju Kim	65beed7952	StatefulSet: Cleanup the complex defer function updating the status In the long term, the complex defer function makes the code harder to maintain as code after it should take that into account. This removes the complex defer function updating the status of a statefulset.	2022-11-08 20:48:26 +09:00
Aohan Yang	d737324312	Be sure to update the status of StatefulSet even if the new replica creation fails	2022-11-08 20:48:26 +09:00
Jakub Przychodzeń	6e3601cc72	NodeLifecycleController: Remove race condition Patch request does not support RV by default, we need to include them explicitly and patching lists actually overwrites whole field. It means that there is a race condition, in which we can overwrite changes to taints that happened between GET and PATCH requests.	2022-10-25 14:09:02 +00:00
xing-yang	479f049df9	Fix unit test	2022-09-07 22:44:17 -04:00
ZhangKe10140699	62e1ea58c4	Fix problem in updating VolumeAttached in node status	2022-09-07 18:21:39 -04:00
Aldo Culquicondor	23e9d632ad	Fix deleting UIDs tracking expectations Change-Id: I5dad644cf5cb232ebed0950a14b35a781a38eeb0	2022-08-05 16:06:34 -04:00
Aldo Culquicondor	24b8252b10	Fix JobTrackingWithFinalizers when a pod succeeds after the job fails Change-Id: I3be351fb3b53216948a37b1d58224f8fbbf22b47	2022-08-03 11:59:13 -04:00
Aldo Culquicondor	71a4c69a21	Do not skip job requeue in conflict error Change-Id: Ie97977887a1cc3de58922d73dce92ae1965965bf	2022-07-08 20:20:04 +00:00
Harsha Narayana	3d1baf7ff2	GIT-110239: fix activeDeadlineSeconds enforcement bug GIT-110239: add additional tests with preset Status.StartTime GIT-110239: add additional tests with preset Status.StartTime GIT-110294: cherry pick changes from #110771	2022-06-24 20:13:46 +05:30
Kubernetes Prow Robot	dde7b34269	Merge pull request #108879 from robscott/automated-cherry-pick-of-#108078-upstream-release-1.23 Automated cherry pick of #108078: Skip updating Endpoints and EndpointSlice if no relevant	2022-06-12 05:06:09 -07:00
Sanskar Jaiswal	6041228d19	move the ignore logic higher up to the reconciler Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>	2022-06-09 13:07:23 +05:30
Sanskar Jaiswal	0e1588c758	Ignore EndpointSlices that are already marked for deletion Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>	2022-06-09 13:07:23 +05:30
Rob Scott	67219f3045	Updating e2e test to check EndpointSlices and Endpoints as well	2022-05-27 23:40:10 +00:00
Antonio Ojea	e0fdecef81	endpoints controller: don't consider terminal endpoints Terminal pods, whose phase its Failed or Succeeded, are guaranteed to never regress and to be stopped, so their IPs never should be published on the Endpoints.	2022-05-27 23:39:37 +00:00
Antonio Ojea	4e9638063f	endpointslices: terminal pods doesn't receive enpoints	2022-05-27 23:30:00 +00:00
Aldo Culquicondor	9b4dee8927	Wait for cache to sync in job's TestWatchOrphanPods Otherwise the event handler might not be called. Change-Id: I23c93c2251b411430a0f2469686db6355d84af2f	2022-05-11 13:07:24 -04:00
Aldo Culquicondor	11e6ec4396	Fix removing finalizer from finished jobs In some rare race conditions, the job controller might create new pods after the job is declared finished. Change-Id: I8a00429c8845463259cd7f82bb3c241d0011583c	2022-05-04 11:09:04 -04:00
Aldo Culquicondor	b680431336	Don't mark job as failed until expectations are satisfied Change-Id: I99206f35f6f145054c005ab362c792e71b9b15f4	2022-05-04 11:09:04 -04:00
Aldo Culquicondor	f75e1b071d	Remove finalizer when orphaned Change-Id: Id88a28755660812a274dffab2693cb8a0ef4235c	2022-03-25 11:23:25 -04:00
Aldo Culquicondor	56d9c45895	Fix: Clean job tracking finalizer from orphan pods Change-Id: I04cd70725fd1830be8daf2dca53f67bc10a379b7	2022-03-25 10:46:59 -04:00
Quan Tian	02f2986b85	Skip updating Endpoints and EndpointSlice if no relevant fields change When comparing EndpointSubsets and Endpoints, we ignore the difference in ResourceVersion of Pod to avoid unnecessary updates caused by Pod updates that we don't care, e.g. annotation update. Otherwise periodic Service resync would intensively update Endpoints or EndpointSlice whose Pods have irrelevant change between two resyncs, leading to delay in processing newly created Services. In a scale cluster with thousands of such Endpoints, we observed 2 minutes of delay when the resync happens.	2022-03-22 10:04:29 -07:00
Kubernetes Prow Robot	6e2d8d549c	Merge pull request #108306 from simonjpartridge/automated-cherry-pick-of-#107997-upstream-release-1.23 Automated cherry pick of #107997: cronjob_controllerv2: do not filter jobs to be reconciled by	2022-03-03 05:16:03 -08:00
Jean-Francois Remy	56bfc202e4	Add unit tests - actual_state_of_world_test.go: test the new method GetVolumesToReportAttachedForNode for an existing node and a non-existing node - node_status_updater_test.go: test UpdateNodeStatuses and UpdateNodeStatuses in nominal case with 2 nodes getting one volume each. Test UpdateNodeStatuses with the first call to node.patch failing but the following one succeeding - add comment in node_status_updater.go - fix log line in reconciler.go - rename variable in actual_state_of_world.go	2022-03-02 10:48:18 -08:00
Jean-Francois Remy	a5faf0b5ce	Fix nodes volumesAttached status not updated The UpdateNodeStatuses code stops too early in case there is an error when calling updateNodeStatus. It will return immediately which means any remaining node won't have its update status put back to true. Looking at the call sites for UpdateNodeStatuses, it appears this is not the only issue. If the lister call fails with anything but a Not Found error, it's silently ignored which is wrong in the detach path. Also the reconciler detach path calls UpdateNodeStatuses but the real intent is to only update the node currently processed in the loop and not proceed with the detach call if there is an error updating that specifi node volumesAttached property. With the current implementation, it will not proceed if there is an error updating another node (which is not completely bad but not ideal) and worse it will proceed if there is a lister error on that node which means the node volumesAttached property won't have been updated. To fix those issues, introduce the following changes: - [node_status_updater] introduce UpdateNodeStatusForNode which does what UpdateNodeStatuses does but only for the provided node - [node_status_updater] if the node lister call fails for anything but a Not Found error, we will return an error, not ignore it - [node_status_updater] if the update of a node volumesAttached properties fails we continue processing the other nodes - [actual_state_of_world] introduce GetVolumesToReportAttachedForNode which does what GetVolumesToReportAttached but for the node whose name is provided it returns a bool which indicates if the node in question needs an update as well as the volumesAttached list. It is used by UpdateNodeStatusForNode - [actual_state_of_world] use write lock in updateNodeStatusUpdateNeeded, we're modifying the map content - [reconciler] use UpdateNodeStatusForNode in the detach loop	2022-03-02 10:47:28 -08:00
d-honeybadger	621894de9d	cronjob_controllerv2: do not filter jobs to be reconciled by labels	2022-02-23 16:19:13 +00:00
Hemant Kumar	f61c4b18c4	use node informer to check volumes attachment status before backoff fix unit tests	2022-01-06 12:15:56 -05:00
Rob Scott	271a9f0e58	Improving performance of EndpointSlice controller metrics cache	2021-12-21 08:45:31 -08:00
Aohan Yang	98cc4f9e96	fix the error when cleaning up jobs for cronjob	2021-12-20 08:45:10 -05:00
Matthew Cary	0e2b901762	Clean up deep copy needed for UpdateStatefulSet Change-Id: Id732358183d682d1a945cfee56f83bcaac0d7c31	2021-11-23 06:48:54 -08:00
Kubernetes Prow Robot	084b28f6d5	Merge pull request #106510 from robscott/topology-ready-fix-controller Updating TopologyCache to disregard unready endpoints in calculations	2021-11-19 17:07:11 -08:00
Rob Scott	9813ec7e8a	Updating TopologyCache to disregard unready endpoints in calculations	2021-11-18 13:54:09 -08:00
Matthew Cary	53b3a6c1d9	controller change for statefulset auto-delete (tests) Change-Id: I16b50e6853bba65fc89c793d2b9b335581c02407	2021-11-17 16:48:50 -08:00
Matthew Cary	bce87a3e4f	controller change for statefulset auto-delete (implementation)	2021-11-17 16:48:50 -08:00
Kubernetes Prow Robot	39c76ba2ed	Merge pull request #106455 from soltysh/cronjob_warning Add warning about using unsupported CRON_TZ	2021-11-16 17:44:31 -08:00
Kubernetes Prow Robot	165b581759	Merge pull request #105623 from ash2k/ash2k/resettable-rest-mapper ResettableRESTMapper to make it possible to reset wrapped mappers	2021-11-16 16:36:08 -08:00
Kubernetes Prow Robot	6805e6ee41	Merge pull request #104722 from leiyiz/migration turning on the CSIMigrationGCE feature flag	2021-11-16 15:28:32 -08:00
Kubernetes Prow Robot	f151a40d8d	Merge pull request #106154 from gnufied/recover-expansion-failure-123 Recover expansion failure	2021-11-16 13:21:34 -08:00
Léiyì Zhang	275fdf0884	fixing unit test failures induced by turning on CSIMigrationGCE disable CSIMigrationGCE in some unit tests	2021-11-16 19:26:30 +00:00
Maciej Szulik	d0518848b5	Add warning about using unsupported CRON_TZ CRON_TZ variable slipped in during upgrading github.com/robfig/cron library. It allows setting a time zone which is a long requested feature but one that is not officially supported. This adds warning event since users should not rely on unsupported features.	2021-11-16 17:41:37 +01:00
Hemant Kumar	1ddd598d31	Implement controller and kubelet changes for recovery from resize failures	2021-11-16 11:06:46 -05:00
Kubernetes Prow Robot	6d1d8c73ee	Merge pull request #106316 from josephburnett/controller-v2 Watch HPA v2 instead of v1.	2021-11-16 06:41:38 -08:00
Joseph Burnett	711f96e05e	Watch HPA v2 instead of v1.	2021-11-16 11:13:21 +01:00
Kubernetes Prow Robot	ce98eda406	Merge pull request #106376 from jsafrane/stabilize-unit-test Fix deletion protection unit test	2021-11-15 13:04:48 -08:00
Neha Lohia	fa1b6765d5	move pkg/util/node to component-helpers/node/util (#105347 ) Signed-off-by: Neha Lohia <nehapithadiya444@gmail.com>	2021-11-12 07:52:27 -08:00
Jan Safranek	bb8157d780	Fix deletion protection unit test The test should not depend on current set of default feature gates, it should always ensure the ones necessary for the tests are set.	2021-11-12 10:47:15 +01:00

1 2 3 4 5 ...

5591 commits