Commit graph

1972 commits

Author SHA1 Message Date
Eric Lippmann
81085c04a4 HA: Give up retrying after 5 minutes
Since we are now retrying every database error,
we also need to set a retry timeout.
2024-04-10 15:24:00 +02:00
Eric Lippmann
9393333581 HA.retry(): Always use context with a deadline
Ensure that updating/inserting of the instance row is also completed by
the current heartbeat's expiry time in takeover scenarios. Otherwise,
there is a risk that the takeover will be performed with an already
expired heartbeat if the attempt takes longer than the expiry time of
the heartbeat.
2024-04-09 16:02:51 +02:00
Eric Lippmann
ed6aab8503 retry: Reduce nil checks for errors
Before, with `Timeout >0`, we had to check errors for `nil` because we
were using a deadline context that might have cancelled before the retry
function completed with an error, but since we now pass `ctx` as-is,
this is no longer necessary.
2024-04-09 16:02:51 +02:00
Eric Lippmann
a468703876 retry: return immediately after context errors
Before, the error may have been handled in the `<-ctx.Done()` case in
the `select` for the backoff sleep, since `DeadlineExceeded` errors
return `true` for `Temporary()` and `Timeout()`.
2024-04-09 16:02:50 +02:00
Eric Lippmann
4112b269ec Revert "retry: if stopped due to outer context, return that error"
This reverts commit a34aef4fc5.

Before, with `Timeout >0`, we had to return `context canceled` and
`context deadline exceeded` errors from the causing context, but since
we now pass `ctx` as-is, this change is no longer necessary.
2024-04-09 16:02:50 +02:00
Eric Lippmann
8ef8a065c3 retry: Don't cancel RetryableFunc if it exceeds Timeout
Before, `if Timeout >0`, we ensured to stop retrying after `Timeout`
expires by passing a deadline context to `RetryFunc`, which aborts the
function once `Timeout` lapsed - assuming that `context.Done()` is
actually taken into account, which applies to all of our usages. I'm
pretty sure we didn't think about functions that run longer than
`Timeout` and therefore could be canceled prematurely. Since we are now
retrying every database error with a timeout of 5 minutes, this could
happen with queries that wait for locks having a generous lock wait
timeout configured in the database server. Now, `RetryableFunc` is
granted full execution time and will not be canceled if `Timeout` is
exceeded. This means that `WithBackoff` may not stop exactly after
`Timeout` expires, or may not retry at all if the first execution of
`RetryableFunc` already takes longer than `Timeout`.
2024-04-09 16:02:50 +02:00
Eric Lippmann
e04087c218 Retention: Also retry DELETE statements 2024-04-09 16:02:50 +02:00
Eric Lippmann
33e07c4eaa Retry **every** database error
So far, we have maintained a list of error codes that should be retried.
This has by no means included all errors that can be retried, and errors
that can occur in a database cluster have not even been considered.
Instead of going through all possible error codes and verifying [1]
whether they should be included in the list of retryable errors,
**every** database error is now simply retried. Of course, this also
means that errors are retried that cannot be retried at all, but since
we now give up after 5 minutes, that's fine.

[1] It's hard to tell from a brief vendor error description whether the
error is actually retryable without context of when and how exactly such
errors are triggered. Also, there are database clusters that send their
own errors using vendor error codes.
2024-04-09 16:02:50 +02:00
Eric Lippmann
a3f8d6aec4 db: Log retried queries and give up after 5 minutes
Before, there was neither logging nor a timeout for retrying queries.
2024-04-09 16:02:50 +02:00
Julian Brost
fd4ffdac56
Merge pull request #733 from Icinga/deprecated-ioutil
Remove deprecated io/ioutil package references
2024-04-09 15:38:35 +02:00
Julian Brost
d91dc1ed53
Merge pull request #732 from Icinga/set-mysql-cfg-logger
Register mysql default logger prior to `mysql#NewConfig()`
2024-04-09 15:27:20 +02:00
Yonas Habteab
c7162e5de1 Register mysql default logger prior to 'mysql#NewConfig()' 2024-04-09 14:49:18 +02:00
Julian Brost
0daca8b15e
Merge pull request #656 from Icinga/max_check_attempts-range
Unify check attempt data type to uint32 already used somewhere
2024-04-09 14:11:09 +02:00
Julian Brost
21cef4eea7
Merge pull request #731 from Icinga/config-example-yml-yaml-database-options-i726
config.example.yml: Comment out unmodified blocks
2024-04-09 11:00:58 +02:00
Julian Brost
ac85b52951 Upgrading docs for state_history schema migration
Co-authored-by: Alexander A. Klimov <alexander.klimov@icinga.com>
2024-04-08 16:01:53 +02:00
Alexander A. Klimov
2681634105 Unify check attempt data type to uint32 already used somewhere
A float isn't necessary as in Icinga 2 Checkable#max_check_attempts and
check_attempt are ints. But uint8 isn't enough for e.g. 1 check/s to get
HARD after 5m (300s > 255).
2024-04-08 16:01:53 +02:00
Alvar Penning
f39c1fb386 Remove deprecated io/ioutil package references
The io/ioutil package is deprecated since Go 1.16. All its functions
were moved to either the io or os package.
2024-04-08 15:56:53 +02:00
Alvar Penning
f692309e72 config.example.yml: Comment out unmodified blocks
As reported in #726, the default values for `database.options` are
overwritten by go-yaml. By commenting out the key of the
empty/unmodified YAML dictionary, this bug is mitigated. To make this
change consistent, the keys of all other unmodified dictionary blocks
have also been commented out.

Close #726.
2024-04-08 14:35:25 +02:00
Julian Brost
0a9f5f1ea9
Merge pull request #686 from Icinga/missing-history-index
Add a correct composite `INDEX` for the history table
2024-04-08 10:34:07 +02:00
Yonas Habteab
79d6f7e85f Add upgrading hints/warnings 2024-04-05 15:48:21 +02:00
Yonas Habteab
2a0da3dec1 Add a correct composite INDEX for the history table 2024-04-05 15:43:20 +02:00
Julian Brost
d9dc16d54a
Merge pull request #711 from Icinga/drop-custom-driver-registry
Try setting `wsrep_sync_wait` for mysql connections
2024-04-05 15:07:04 +02:00
Julian Brost
89f56ac209
Merge pull request #728 from Icinga/dependabot/go_modules/tests/golang.org/x/sync-0.7.0
build(deps): bump golang.org/x/sync from 0.6.0 to 0.7.0 in /tests
2024-04-05 11:28:46 +02:00
Julian Brost
701639c15f
Merge pull request #727 from Icinga/dependabot/go_modules/golang.org/x/sync-0.7.0
build(deps): bump golang.org/x/sync from 0.6.0 to 0.7.0
2024-04-05 11:28:25 +02:00
dependabot[bot]
58517323a2
build(deps): bump golang.org/x/sync from 0.6.0 to 0.7.0 in /tests
Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.6.0 to 0.7.0.
- [Commits](https://github.com/golang/sync/compare/v0.6.0...v0.7.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sync
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-04-04 18:28:36 +00:00
dependabot[bot]
5ceacaa0f0
build(deps): bump golang.org/x/sync from 0.6.0 to 0.7.0
Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.6.0 to 0.7.0.
- [Commits](https://github.com/golang/sync/compare/v0.6.0...v0.7.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sync
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-04-04 18:18:30 +00:00
Julian Brost
80abf2b7b7
Merge pull request #692 from Icinga/ha-logging-i688
Enhance HA "Taking over", "Handing over" logging
2024-04-04 10:33:36 +02:00
Julian Brost
2826af4f8d
Merge pull request #725 from Icinga/wrong-number-xdel-args
`history.Sync`: Don't operate on closed channel
2024-04-02 14:20:57 +02:00
Alvar Penning
779afd1da3 Enhance HA "Taking over", "Handing over" logging
The reason for a switch in the HA roles was not always directly clear.
This change now introduces additional debug logging, indicating the
reasoning for either taking over or handing over the HA responsibility.

First, some logic was moved from the SQL query selecting active Icinga
DB instances to Go code. This allowed distinguishing between no
available responsible instances and responsible instances with an
expired heartbeat.

As the HA's peer timeout is logically bound to the Redis timeout, it
will now reference this timeout with an additional grace timeout. Doing
so eliminates a race between a handing over and a "forceful" take over.

As the old code indicated a takeover on the fact that no other instance
is active, it will now additionally check if it is already being the
active/responsible node. In this case, the takeover logic - which will
be interrupted at a later point as the node is already responsible - can
be skipped.

Next to the additional logging messages, both the takeover and handover
channel are now transporting a string to communicate the reason instead
of an empty struct{}. By doing so, both the "Taking over" and "Handing
over" log messages are enriched with reason.

This also required a change in the suppressed logging handling of the
HA.realize method, which got its logging enabled through the shouldLog
parameter. Now, there are both recurring events, which might be
suppressed, as well as state changing events, which should be logged.
Therefore, and because the logTicker's functionality was not clear to me
on first glance, I renamed it to routineLogTicker.

While dealing with the code, some function signature documentation were
added, to ease both mine as well as the understanding of future readers.

Additionally, the error handling of the SQL query selecting active
Icinga DB instances was changed slightly to also handle wrapped
sql.ErrNoRows errors.

Closes #688.
2024-04-02 13:23:11 +02:00
Yonas Habteab
ce56dffa8f history.Sync: Don't operate on closed channel 2024-03-28 14:52:27 +01:00
Yonas Habteab
a8075ea1d1 Validate wsrep_sync_wait database option 2024-03-28 13:25:23 +01:00
Yonas Habteab
735135ea7b Document wsrep_sync_wait database option 2024-03-28 13:24:57 +01:00
Yonas Habteab
9a252a0e9d Drop icingadb#Register() & make mysqlLogger exportable 2024-03-28 13:19:44 +01:00
Yonas Habteab
9713cdc65e Database: Drop registerDriverOnce variable 2024-03-28 13:19:44 +01:00
Yonas Habteab
eaf9744f16 Move pkg/driver to pkg/icingadb/driver.go 2024-03-28 13:19:44 +01:00
Yonas Habteab
cacbae19f3 driver: Move timeout from package level to a function scope
Conflicts with the `timeout` variable in `ha.go` file.
2024-03-28 13:19:44 +01:00
Alexander A. Klimov
4d0b58cfb4 MySQL driver: on connect try setting wsrep_sync_wait, swallow error 1193
In Galera clusters wsrep_sync_wait=7 lets statements catch up all
pending sync between nodes first. This way new child rows await fresh parent
ones from other nodes not to run into foreign key errors. MySQL single nodes
will reject this with error 1193 "Unknown system variable" which is OK.
2024-03-28 13:19:44 +01:00
Yonas Habteab
5348f8127e Driver: Allow to post initialize database connections 2024-03-28 13:19:44 +01:00
Yonas Habteab
e600cf107c Drop superfluous custom driver registration 2024-03-28 13:19:44 +01:00
Alexander A. Klimov
0b94df86a6 Make value for SET SESSION wsrep_sync_wait configurable 2024-03-28 13:19:44 +01:00
Julian Brost
51e5434374
Merge pull request #723 from Icinga/doc-database-options
Document `database.options` properly
2024-03-28 11:15:52 +01:00
Yonas Habteab
6a14e557ca Document database.options properly 2024-03-27 14:41:47 +01:00
Julian Brost
1ab2e0ed1e
Merge pull request #718 from Icinga/add-missing-retention-count-doc
Document `retention.count` & `retention.interval` options
2024-03-27 10:54:28 +01:00
Julian Brost
6f7558d96f
Merge pull request #721 from Icinga/dependabot/go_modules/tests/github.com/go-sql-driver/mysql-1.8.1
build(deps): bump github.com/go-sql-driver/mysql from 1.8.0 to 1.8.1 in /tests
2024-03-27 09:40:56 +01:00
Julian Brost
7c93498aa4
Merge pull request #720 from Icinga/dependabot/go_modules/github.com/go-sql-driver/mysql-1.8.1
build(deps): bump github.com/go-sql-driver/mysql from 1.8.0 to 1.8.1
2024-03-27 09:40:27 +01:00
dependabot[bot]
8289f5e95b
build(deps): bump github.com/go-sql-driver/mysql in /tests
Bumps [github.com/go-sql-driver/mysql](https://github.com/go-sql-driver/mysql) from 1.8.0 to 1.8.1.
- [Release notes](https://github.com/go-sql-driver/mysql/releases)
- [Changelog](https://github.com/go-sql-driver/mysql/blob/v1.8.1/CHANGELOG.md)
- [Commits](https://github.com/go-sql-driver/mysql/compare/v1.8.0...v1.8.1)

---
updated-dependencies:
- dependency-name: github.com/go-sql-driver/mysql
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-26 18:34:28 +00:00
dependabot[bot]
99763d55dc
build(deps): bump github.com/go-sql-driver/mysql from 1.8.0 to 1.8.1
Bumps [github.com/go-sql-driver/mysql](https://github.com/go-sql-driver/mysql) from 1.8.0 to 1.8.1.
- [Release notes](https://github.com/go-sql-driver/mysql/releases)
- [Changelog](https://github.com/go-sql-driver/mysql/blob/v1.8.1/CHANGELOG.md)
- [Commits](https://github.com/go-sql-driver/mysql/compare/v1.8.0...v1.8.1)

---
updated-dependencies:
- dependency-name: github.com/go-sql-driver/mysql
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-26 18:01:21 +00:00
Yonas Habteab
71a8f2d962 Document retention.count & retention.interval options 2024-03-26 15:36:35 +01:00
Julian Brost
2c468302ae
Merge pull request #657 from Icinga/Flatten-no-exponent
Flatten(): render even large numbers as-is, not using scientific notation
2024-03-25 16:32:32 +01:00
Alexander A. Klimov
17b63b214d Flatten(): render even large numbers as-is, not using scientific notation
E.g. 2000000000000000000 (explicitly), not 2e+18 (as with fmt.Sprintf("%v")).
2024-03-25 14:52:54 +01:00