icingadb

mirror of https://github.com/Icinga/icingadb.git synced 2026-04-15 14:29:48 -04:00

Author	SHA1	Message	Date
Eric Lippmann	2b3a5d4229	`retry`: Set `attempt`'s initial value to `1` This change simplifies the use of `attempt` as a number for reading in log messages and `if`s. Also before, with `attempt` starting with `0`, the second attempt would have been taken immediately, as our backoff implementation returns `0` in this case. Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>	2024-04-10 15:25:15 +02:00
Eric Lippmann	c2b449d3a6	`HA`: Don't log retry count Logging of the `attempt` is a meaningless metric as it is not constantly logged but only when the retryable error changes, and it has no context as there is no such thing as max attempts.	2024-04-10 15:25:15 +02:00
Eric Lippmann	51a6ef25b8	`retry`: Explicitly check context for error The retryable function may exit prematurely due to context errors that shouldn't be retried. Before, we checked the returned error for context errors, i.e. used `errors.Is()` to compare it to `Canceled` and `DeadlineExceeded` which also yields `true` for errors that implement `Is()` accordingly. For example, this applies to some non-exported Go `net` errors. Now we explicitly check the context error instead.	2024-04-10 15:25:15 +02:00
Eric Lippmann	357bcdbb73	`retry`: Execute on error callbacks for retryable errors only All of our error callbacks are used to log the error and indicate that we are retrying. Previously, in the case of context errors or non-retryable errors, we would have called these too, which would have resulted in misleading log messages.	2024-04-10 15:25:15 +02:00
Eric Lippmann	3a664c1696	Consolidate default retry timeout and settings	2024-04-10 15:25:14 +02:00
Eric Lippmann	81085c04a4	`HA`: Give up retrying after 5 minutes Since we are now retrying every database error, we also need to set a retry timeout.	2024-04-10 15:24:00 +02:00
Eric Lippmann	9393333581	`HA.retry()`: Always use context with a deadline Ensure that updating/inserting of the instance row is also completed by the current heartbeat's expiry time in takeover scenarios. Otherwise, there is a risk that the takeover will be performed with an already expired heartbeat if the attempt takes longer than the expiry time of the heartbeat.	2024-04-09 16:02:51 +02:00
Eric Lippmann	ed6aab8503	`retry`: Reduce `nil` checks for errors Before, with `Timeout >0`, we had to check errors for `nil` because we were using a deadline context that might have cancelled before the retry function completed with an error, but since we now pass `ctx` as-is, this is no longer necessary.	2024-04-09 16:02:51 +02:00
Eric Lippmann	a468703876	`retry`: `return` immediately after context errors Before, the error may have been handled in the `<-ctx.Done()` case in the `select` for the backoff sleep, since `DeadlineExceeded` errors return `true` for `Temporary()` and `Timeout()`.	2024-04-09 16:02:50 +02:00
Eric Lippmann	4112b269ec	Revert "retry: if stopped due to outer context, return that error" This reverts commit `a34aef4fc5`. Before, with `Timeout >0`, we had to return `context canceled` and `context deadline exceeded` errors from the causing context, but since we now pass `ctx` as-is, this change is no longer necessary.	2024-04-09 16:02:50 +02:00
Eric Lippmann	8ef8a065c3	`retry`: Don't cancel `RetryableFunc` if it exceeds `Timeout` Before, `if Timeout >0`, we ensured to stop retrying after `Timeout` expires by passing a deadline context to `RetryFunc`, which aborts the function once `Timeout` lapsed - assuming that `context.Done()` is actually taken into account, which applies to all of our usages. I'm pretty sure we didn't think about functions that run longer than `Timeout` and therefore could be canceled prematurely. Since we are now retrying every database error with a timeout of 5 minutes, this could happen with queries that wait for locks having a generous lock wait timeout configured in the database server. Now, `RetryableFunc` is granted full execution time and will not be canceled if `Timeout` is exceeded. This means that `WithBackoff` may not stop exactly after `Timeout` expires, or may not retry at all if the first execution of `RetryableFunc` already takes longer than `Timeout`.	2024-04-09 16:02:50 +02:00
Eric Lippmann	e04087c218	Retention: Also retry `DELETE` statements	2024-04-09 16:02:50 +02:00
Eric Lippmann	33e07c4eaa	Retry every database error So far, we have maintained a list of error codes that should be retried. This has by no means included all errors that can be retried, and errors that can occur in a database cluster have not even been considered. Instead of going through all possible error codes and verifying [1] whether they should be included in the list of retryable errors, every database error is now simply retried. Of course, this also means that errors are retried that cannot be retried at all, but since we now give up after 5 minutes, that's fine. [1] It's hard to tell from a brief vendor error description whether the error is actually retryable without context of when and how exactly such errors are triggered. Also, there are database clusters that send their own errors using vendor error codes.	2024-04-09 16:02:50 +02:00
Eric Lippmann	a3f8d6aec4	`db`: Log retried queries and give up after 5 minutes Before, there was neither logging nor a timeout for retrying queries.	2024-04-09 16:02:50 +02:00
Julian Brost	fd4ffdac56	Merge pull request #733 from Icinga/deprecated-ioutil Remove deprecated io/ioutil package references	2024-04-09 15:38:35 +02:00
Yonas Habteab	c7162e5de1	Register mysql default logger prior to 'mysql#NewConfig()'	2024-04-09 14:49:18 +02:00
Alexander A. Klimov	2681634105	Unify check attempt data type to uint32 already used somewhere A float isn't necessary as in Icinga 2 Checkable#max_check_attempts and check_attempt are ints. But uint8 isn't enough for e.g. 1 check/s to get HARD after 5m (300s > 255).	2024-04-08 16:01:53 +02:00
Alvar Penning	f39c1fb386	Remove deprecated io/ioutil package references The io/ioutil package is deprecated since Go 1.16. All its functions were moved to either the io or os package.	2024-04-08 15:56:53 +02:00
Julian Brost	d9dc16d54a	Merge pull request #711 from Icinga/drop-custom-driver-registry Try setting `wsrep_sync_wait` for mysql connections	2024-04-05 15:07:04 +02:00
Julian Brost	80abf2b7b7	Merge pull request #692 from Icinga/ha-logging-i688 Enhance HA "Taking over", "Handing over" logging	2024-04-04 10:33:36 +02:00
Alvar Penning	779afd1da3	Enhance HA "Taking over", "Handing over" logging The reason for a switch in the HA roles was not always directly clear. This change now introduces additional debug logging, indicating the reasoning for either taking over or handing over the HA responsibility. First, some logic was moved from the SQL query selecting active Icinga DB instances to Go code. This allowed distinguishing between no available responsible instances and responsible instances with an expired heartbeat. As the HA's peer timeout is logically bound to the Redis timeout, it will now reference this timeout with an additional grace timeout. Doing so eliminates a race between a handing over and a "forceful" take over. As the old code indicated a takeover on the fact that no other instance is active, it will now additionally check if it is already being the active/responsible node. In this case, the takeover logic - which will be interrupted at a later point as the node is already responsible - can be skipped. Next to the additional logging messages, both the takeover and handover channel are now transporting a string to communicate the reason instead of an empty struct{}. By doing so, both the "Taking over" and "Handing over" log messages are enriched with reason. This also required a change in the suppressed logging handling of the HA.realize method, which got its logging enabled through the shouldLog parameter. Now, there are both recurring events, which might be suppressed, as well as state changing events, which should be logged. Therefore, and because the logTicker's functionality was not clear to me on first glance, I renamed it to routineLogTicker. While dealing with the code, some function signature documentation were added, to ease both mine as well as the understanding of future readers. Additionally, the error handling of the SQL query selecting active Icinga DB instances was changed slightly to also handle wrapped sql.ErrNoRows errors. Closes #688.	2024-04-02 13:23:11 +02:00
Yonas Habteab	ce56dffa8f	`history.Sync`: Don't operate on closed channel	2024-03-28 14:52:27 +01:00
Yonas Habteab	a8075ea1d1	Validate `wsrep_sync_wait` database option	2024-03-28 13:25:23 +01:00
Yonas Habteab	735135ea7b	Document `wsrep_sync_wait` database option	2024-03-28 13:24:57 +01:00
Yonas Habteab	9a252a0e9d	Drop `icingadb#Register()` & make `mysqlLogger` exportable	2024-03-28 13:19:44 +01:00
Yonas Habteab	9713cdc65e	Database: Drop `registerDriverOnce` variable	2024-03-28 13:19:44 +01:00
Yonas Habteab	eaf9744f16	Move `pkg/driver` to `pkg/icingadb/driver.go`	2024-03-28 13:19:44 +01:00
Yonas Habteab	cacbae19f3	driver: Move `timeout` from package level to a function scope Conflicts with the `timeout` variable in `ha.go` file.	2024-03-28 13:19:44 +01:00
Alexander A. Klimov	4d0b58cfb4	MySQL driver: on connect try setting wsrep_sync_wait, swallow error 1193 In Galera clusters wsrep_sync_wait=7 lets statements catch up all pending sync between nodes first. This way new child rows await fresh parent ones from other nodes not to run into foreign key errors. MySQL single nodes will reject this with error 1193 "Unknown system variable" which is OK.	2024-03-28 13:19:44 +01:00
Yonas Habteab	5348f8127e	Driver: Allow to post initialize database connections	2024-03-28 13:19:44 +01:00
Yonas Habteab	e600cf107c	Drop superfluous custom driver registration	2024-03-28 13:19:44 +01:00
Alexander A. Klimov	0b94df86a6	Make value for SET SESSION `wsrep_sync_wait` configurable	2024-03-28 13:19:44 +01:00
Julian Brost	2c468302ae	Merge pull request #657 from Icinga/Flatten-no-exponent Flatten(): render even large numbers as-is, not using scientific notation	2024-03-25 16:32:32 +01:00
Alexander A. Klimov	17b63b214d	Flatten(): render even large numbers as-is, not using scientific notation E.g. 2000000000000000000 (explicitly), not 2e+18 (as with fmt.Sprintf("%v")).	2024-03-25 14:52:54 +01:00
Alexander A. Klimov	365f97d092	Flatten(): type-check input only once	2024-03-25 14:52:54 +01:00
Alexander A. Klimov	10afc562ce	Use types.MakeString() instead of manual initialization (refactor)	2024-03-25 14:52:54 +01:00
Alexander A. Klimov	d36ade1f14	Test Flatten()	2024-03-25 14:52:54 +01:00
Alexander A. Klimov	e2fc7695e0	Introduce types.MakeString()	2024-03-25 14:52:54 +01:00
Julian Brost	194028a35a	Merge pull request #715 from Icinga/go-redis-v9 Upgrade `go-redis` to `v9`	2024-03-25 11:02:02 +01:00
Yonas Habteab	a27a743c27	Merge pull request #684 from Icinga/Al2Klimov-patch-2 Remove redundant closure	2024-03-25 09:26:38 +01:00
Eric Lippmann	e31b101f4f	Upgrade `go-redis` to `v9` Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>	2024-03-22 15:32:15 +01:00
Julian Brost	112f6d7966	Merge pull request #699 from Icinga/mysql-strict-mode MySQL/MariaDB: Use strict SQL mode	2024-03-21 15:49:47 +01:00
Eric Lippmann	16d43cb10a	MySQL/MariaDB: Use strict SQL mode For MySQL (and MariaDB, etc.), in addition to `ANSI_QUOTES` SQL mode, we now also set `TRADITIONAL`, which enables strict mode.	2024-03-19 09:53:20 +01:00
Eric Lippmann	2586c62251	Retry broken pipe errors (`EPIPE`)	2024-03-14 09:51:34 +01:00
Alexander Aleksandrovič Klimov	29d1a6bafc	Remove redundant closure which just wraps a method with equal signature.	2024-03-08 16:41:31 +01:00
Julian Brost	653f356123	Increase database schema version The removal of the `NOT NULL` constraint on `customvar_flat`.`flat_value` makes the schema upgrade a hard requirement for 1.1.1.	2023-08-07 13:09:01 +02:00
Julian Brost	9c2dcd2502	Run `go fmt ./...` Looks like newer Go version have a different opinion on how indentation in comments should look like. Adapt existing comments to make the GitHub Actions happy.	2023-08-04 12:50:45 +02:00
Julian Brost	336ee4a8ab	Merge pull request #554 from Icinga/553 convertFlappingRows(): fix foreign key error history -> flapping_history	2023-07-31 15:00:39 +02:00
Julian Brost	71c1d2fa4d	Migration: refactor output/processing of converted entities This commit simplifies the `icingaDbOutputStage` type to contain only one entity slice to be insert/upsert. This allows to simplify the handling in `migrateOneType()` by removing nested loops. Additionally, a bit of code inside that function is outsourced into a new `utils.ChanFromSlice()` function. This makes the body of the loop over the insert/upsert operation (the loop using the `op` variable) simple enough so that it can just be unrolled which saves the inline struct and slice definition for that loop.	2023-07-31 11:10:42 +02:00
Julian Brost	62f7ae9114	Merge pull request #609 from Icinga/percona-xtradb-cluster Support Percona XtraDB Cluster by not using SERIALIZABLE transactions directly	2023-07-31 10:24:10 +02:00
Alexander Aleksandrovič Klimov	99de1079f8	Merge pull request #593 from Icinga/wait-for-database-to-start-rather-than-crashing-561 Merge network and database error retryability detection functions	2023-07-27 17:58:06 +02:00
Julian Brost	68d26a6873	Merge pull request #601 from Icinga/flatten-empty-custom-vars-correctly Write a hint for empty arrays/dicts into `customvar_flat`	2023-07-25 15:28:07 +02:00
Julian Brost	536c808bca	Merge pull request #605 from Icinga/yaml.DisallowUnknownField Config parsing: fail on unknown fields and print them	2023-07-25 10:29:29 +02:00
Julian Brost	ef09059549	Merge pull request #612 from Icinga/bool-binary-unixmilli-marshaljson-return-valid-json-not-empty-string {Bool,Binary,UnixMilli}#MarshalJSON(): return valid JSON, not empty string	2023-07-25 10:28:44 +02:00
Alexander A. Klimov	af868b1762	Config parsing: unit test failure due to unknown fields	2023-07-11 09:57:54 +02:00
Alexander A. Klimov	b8ed25c87a	Test UnixMilli#MarshalJSON()	2023-07-07 16:47:45 +02:00
Alexander A. Klimov	7568c47378	Test Bool#MarshalJSON()	2023-07-07 16:47:45 +02:00
Alexander A. Klimov	0745ba7d9e	Test Binary#MarshalJSON()	2023-07-07 16:47:45 +02:00
Alexander A. Klimov	0291c860a1	{Bool,Binary,UnixMilli}#MarshalJSON(): return valid JSON, not empty string in case an instance is null.	2023-07-04 15:38:17 +02:00
Yonas Habteab	fa0a712bac	Flatten empty custom vars of type `array` & `map` correctly	2023-06-29 13:43:33 +02:00
Alexander A. Klimov	6dc4998802	Support Percona XtraDB Cluster by not using SERIALIZABLE transactions directly The RDBMS rejects them by default. But it doesn't rejects their equivalent: Append "LOCK IN SHARE MODE" to every SELECT in a REPEATABLE READ transaction. Now we do the latter with MySQL.	2023-06-22 15:13:40 +02:00
Alexander A. Klimov	a163694dc6	Config parsing: fail on unknown fields and print them Useful against config validation or runtime failures caused by wrong field spelling or YAML indentation.	2023-06-19 17:38:52 +02:00
Julian Brost	78fa223cab	Merge pull request #559 from Icinga/segv Heartbeat#sendEvent(m): nil-check m before dereferencing it	2023-06-05 12:54:29 +02:00
Alexander A. Klimov	a3c1007d47	retry.Retryable(): treat ENOENT (AF_UNIX) like ECONNREFUSED, i.e. also retry During connect(2) we may get ECONNREFUSED between server's bind(2) and listen(2), but the most downtime between boot and service start the socket won't exist, yet. I.e. ENOENT is the de facto ECONNREFUSED of *nix sockets.	2023-06-05 11:21:45 +02:00
Alexander A. Klimov	e776c99ede	Merge network and database error retryability detection functions so that connection attempts will also be re-tried on RDBMS-specific errors, e.g. Postgres' 57P03 (the database system is starting up), not to crash. On the other hand, SQL operations which are safe to retry on SQL errors are also safe to retry on network errors.	2023-05-26 12:21:09 +02:00
Alexander A. Klimov	5a79a72ff5	Heartbeat#sendEvent(m): nil-check m before dereferencing it as it can be nil.	2023-01-19 16:55:11 +01:00
Alexander A. Klimov	ab14413393	Log UNIX socket address w/o port number E.g. not Connecting to database at '/var/lib/mysql/mysql.sock:0' Connecting to Redis at '/run/icingadb-redis/icingadb-redis-server.sock:6380'	2022-11-09 11:03:50 +01:00
Julian Brost	a327ef0275	Merge pull request #525 from Icinga/save-memory Save memory during config sync via SyncSubject#FactoryForDelta()	2022-11-02 12:27:06 +01:00
Alexander A. Klimov	f7d132ccfa	Make checkDbSchema() reusable as DB#CheckSchema()	2022-10-11 16:32:22 +02:00
Alexander A. Klimov	adcd004231	Introduce DB#CreateIgnoreStreamed()	2022-10-11 12:46:48 +02:00
Alexander A. Klimov	f063687b2b	DB#BuildInsertIgnoreStmt(): handle primary key being not "id"	2022-10-11 12:46:47 +02:00
Alexander A. Klimov	581270ffee	cmd/ido2icingadb: support Postgres	2022-10-11 12:46:20 +02:00
Alexander A. Klimov	6209b5b376	Save memory during config sync via SyncSubject#FactoryForDelta() Code comment TL;DR: Allocate the same amount of smaller data structures	2022-09-13 17:57:23 +02:00
Alexander Aleksandrovič Klimov	305800fdb0	Merge pull request #503 from Icinga/parseString-DRY parseString(): Don't Repeat Yourself	2022-08-19 11:46:34 +02:00
Eric Lippmann	cf8e12f391	Set Redis server port to 6380 by default All Icinga DB components use 6380 as default Redis port.	2022-06-29 15:42:04 +02:00
Julian Brost	5d25d81922	Merge pull request #508 from Icinga/state-soft_state icinga:*:state: rename state to soft_state	2022-06-29 13:25:11 +02:00
Alexander A. Klimov	ae2c40e998	icinga:*:state: rename state to soft_state	2022-06-29 11:56:29 +02:00
Julian Brost	924d455348	Merge pull request #504 from Icinga/retry Fixes related to the retry functionality	2022-06-29 10:41:31 +02:00
Eric Lippmann	df72c81708	Retry down and unreachable host or network errors	2022-06-29 09:59:12 +02:00
Eric Lippmann	9cb8bf36a6	Fix comment	2022-06-28 21:14:28 +02:00
Eric Lippmann	444332a682	Retry ECONNRESET ECONNRESET is treated as a temporary error by Go only if it comes from calling accept.	2022-06-28 19:58:02 +02:00
Julian Brost	e8f611ddc6	Merge pull request #505 from Icinga/make-json-keys-and-db-columns-consistent Make json keys and DB columns consistent	2022-06-28 16:45:52 +02:00
Eric Lippmann	6a5db1ca94	Retry: Detect ECONNREFUSED properly Also the order of the checks has been adjusted and the documentation has been adapted to it. In addition, EAGAIN is no longer checked, since this is already done via Timeout().	2022-06-28 16:09:32 +02:00
Eric Lippmann	cd96f0de6f	Block XREADs for a maxium of one second I just had the observation that blocking XREADs without timeouts (BLOCK 0) on multiple consecutive Redis restarts and I/O timeouts exceeds Redis internal retries and eventually leads to fatal errors. @julianbrost looked at this for clarification, here is his finding: go-redis only considers a command successful when it returned something, so a successfully started blocking XREAD consumes a retry attempt each time the underlying Redis connection is terminated. If this happens often before any element appears in the stream, this error is propagated. (This also means that even with this PR, when restarting Redis often enough so that a query never reaches the BLOCK 1sec, this would still happen.) https://github.com/Icinga/icingadb/pull/504#issuecomment-1164589244	2022-06-28 16:09:29 +02:00
Eric Lippmann	d9e876905f	Fix Redis MaxRetries Previously, we set the maximum number of retries to the pool size + 1, but increased the pool size immediately after this assignment, so the maximum number of retries was always too low for systems with less than 4 cores. Now it is set the other way around.	2022-06-28 16:09:04 +02:00
Eric Lippmann	5f29caecbe	Merge pull request #464 from Icinga/OwnHeartbeat Write own status into Redis	2022-06-28 15:21:03 +02:00
Julian Brost	061660b023	Telemetry: use mutex for synchronizing last database error The old CompareAndSwap based code tended to end up in an endless loop. Replace it by simple syncrhonization mechanisms where this can't happen.	2022-06-28 13:30:00 +02:00
Julian Brost	def7c5f22c	Telemetry: change stats names in Redis The same names are used in perfdata names and config_sync sounds more natural than sync_config.	2022-06-28 13:30:00 +02:00
Julian Brost	741460c935	Telemetry: rename keys in heartbeat stream In both C++ and Go, the keys are only used as constant strings, so namespacing them just adds clutter for the `general:*` keys, therefore remove it.	2022-06-28 13:30:00 +02:00
Julian Brost	36d5f7b33c	Telemetry: send Go metrics as performance data string Rather than using a JSON structure to convey these values, simply use the existing format to communicate performance data to Icinga 2. Also removes the reference to Go in the Redis structure, allowing this string to be extended with more metrics in the future without running into naming issues.	2022-06-28 13:30:00 +02:00
Yonas Habteab	fa6c23d634	Make json keys and DB columns consistent	2022-06-24 16:00:11 +02:00
Alexander A. Klimov	e1ff704aff	Write own heartbeat into icingadb:telemetry:heartbeat including version, current DB error and HA status quo.	2022-06-23 18:31:45 +02:00
Alexander A. Klimov	80ab823294	Introduce Atomic[T]	2022-06-23 18:31:45 +02:00
Alexander A. Klimov	64d7f1be43	Remove unused StreamLastId()	2022-06-23 18:31:45 +02:00
Alexander A. Klimov	d85d070d1f	Clear icinga:runtime* and read from 0-0 later instead of preserving the (never read) data and reading beyond its end later. This indicates the correct number of pending runtime updates (for monitoring by Icinga 2) from the beginning.	2022-06-23 18:31:45 +02:00
Alexander A. Klimov	9b618c690a	XTRIM data XREAD from icinga:runtime* for Icinga 2 to monitor pending runtime updates.	2022-06-22 17:38:58 +02:00
Alexander A. Klimov	6627ecbfad	parseString(): Don't Repeat Yourself	2022-06-22 15:27:00 +02:00
Alexander A. Klimov	2bda98cbe4	oneBulk(): terminate once input closed (like the regular bulker) instead of outputting zero values.	2022-06-22 12:32:30 +02:00
Alexander A. Klimov	fac9f5e4e5	Write ops/s by op and s to icingadb:telemetry:stats	2022-06-15 09:51:59 +02:00
Alexander A. Klimov	0e5d098be4	DB#CleanupOlderThan(): allow to get done work counted in real time	2022-06-15 09:51:59 +02:00

1 2 3 4 5 ...

602 commits