Commit graph

602 commits

Author SHA1 Message Date
Eric Lippmann
2b3a5d4229 retry: Set attempt's initial value to 1
This change simplifies the use of `attempt` as a number for reading in
log messages and `if`s. Also before, with `attempt` starting with `0`,
the second attempt would have been taken immediately, as our backoff
implementation returns `0` in this case.

Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>
2024-04-10 15:25:15 +02:00
Eric Lippmann
c2b449d3a6 HA: Don't log retry count
Logging of the `attempt` is a meaningless metric as it is not constantly
logged but only when the retryable error changes, and it has no context
as there is no such thing as max attempts.
2024-04-10 15:25:15 +02:00
Eric Lippmann
51a6ef25b8 retry: Explicitly check context for error
The retryable function may exit prematurely due to context errors that
shouldn't be retried. Before, we checked the returned error for context
errors, i.e. used `errors.Is()` to compare it to `Canceled` and
`DeadlineExceeded` which also yields `true` for errors that implement
`Is()` accordingly. For example, this applies to some non-exported Go
`net` errors. Now we explicitly check the context error instead.
2024-04-10 15:25:15 +02:00
Eric Lippmann
357bcdbb73 retry: Execute on error callbacks for retryable errors only
All of our error callbacks are used to log the error and indicate that
we are retrying. Previously, in the case of context errors or
non-retryable errors, we would have called these too, which would have
resulted in misleading log messages.
2024-04-10 15:25:15 +02:00
Eric Lippmann
3a664c1696 Consolidate default retry timeout and settings 2024-04-10 15:25:14 +02:00
Eric Lippmann
81085c04a4 HA: Give up retrying after 5 minutes
Since we are now retrying every database error,
we also need to set a retry timeout.
2024-04-10 15:24:00 +02:00
Eric Lippmann
9393333581 HA.retry(): Always use context with a deadline
Ensure that updating/inserting of the instance row is also completed by
the current heartbeat's expiry time in takeover scenarios. Otherwise,
there is a risk that the takeover will be performed with an already
expired heartbeat if the attempt takes longer than the expiry time of
the heartbeat.
2024-04-09 16:02:51 +02:00
Eric Lippmann
ed6aab8503 retry: Reduce nil checks for errors
Before, with `Timeout >0`, we had to check errors for `nil` because we
were using a deadline context that might have cancelled before the retry
function completed with an error, but since we now pass `ctx` as-is,
this is no longer necessary.
2024-04-09 16:02:51 +02:00
Eric Lippmann
a468703876 retry: return immediately after context errors
Before, the error may have been handled in the `<-ctx.Done()` case in
the `select` for the backoff sleep, since `DeadlineExceeded` errors
return `true` for `Temporary()` and `Timeout()`.
2024-04-09 16:02:50 +02:00
Eric Lippmann
4112b269ec Revert "retry: if stopped due to outer context, return that error"
This reverts commit a34aef4fc5.

Before, with `Timeout >0`, we had to return `context canceled` and
`context deadline exceeded` errors from the causing context, but since
we now pass `ctx` as-is, this change is no longer necessary.
2024-04-09 16:02:50 +02:00
Eric Lippmann
8ef8a065c3 retry: Don't cancel RetryableFunc if it exceeds Timeout
Before, `if Timeout >0`, we ensured to stop retrying after `Timeout`
expires by passing a deadline context to `RetryFunc`, which aborts the
function once `Timeout` lapsed - assuming that `context.Done()` is
actually taken into account, which applies to all of our usages. I'm
pretty sure we didn't think about functions that run longer than
`Timeout` and therefore could be canceled prematurely. Since we are now
retrying every database error with a timeout of 5 minutes, this could
happen with queries that wait for locks having a generous lock wait
timeout configured in the database server. Now, `RetryableFunc` is
granted full execution time and will not be canceled if `Timeout` is
exceeded. This means that `WithBackoff` may not stop exactly after
`Timeout` expires, or may not retry at all if the first execution of
`RetryableFunc` already takes longer than `Timeout`.
2024-04-09 16:02:50 +02:00
Eric Lippmann
e04087c218 Retention: Also retry DELETE statements 2024-04-09 16:02:50 +02:00
Eric Lippmann
33e07c4eaa Retry **every** database error
So far, we have maintained a list of error codes that should be retried.
This has by no means included all errors that can be retried, and errors
that can occur in a database cluster have not even been considered.
Instead of going through all possible error codes and verifying [1]
whether they should be included in the list of retryable errors,
**every** database error is now simply retried. Of course, this also
means that errors are retried that cannot be retried at all, but since
we now give up after 5 minutes, that's fine.

[1] It's hard to tell from a brief vendor error description whether the
error is actually retryable without context of when and how exactly such
errors are triggered. Also, there are database clusters that send their
own errors using vendor error codes.
2024-04-09 16:02:50 +02:00
Eric Lippmann
a3f8d6aec4 db: Log retried queries and give up after 5 minutes
Before, there was neither logging nor a timeout for retrying queries.
2024-04-09 16:02:50 +02:00
Julian Brost
fd4ffdac56
Merge pull request #733 from Icinga/deprecated-ioutil
Remove deprecated io/ioutil package references
2024-04-09 15:38:35 +02:00
Yonas Habteab
c7162e5de1 Register mysql default logger prior to 'mysql#NewConfig()' 2024-04-09 14:49:18 +02:00
Alexander A. Klimov
2681634105 Unify check attempt data type to uint32 already used somewhere
A float isn't necessary as in Icinga 2 Checkable#max_check_attempts and
check_attempt are ints. But uint8 isn't enough for e.g. 1 check/s to get
HARD after 5m (300s > 255).
2024-04-08 16:01:53 +02:00
Alvar Penning
f39c1fb386 Remove deprecated io/ioutil package references
The io/ioutil package is deprecated since Go 1.16. All its functions
were moved to either the io or os package.
2024-04-08 15:56:53 +02:00
Julian Brost
d9dc16d54a
Merge pull request #711 from Icinga/drop-custom-driver-registry
Try setting `wsrep_sync_wait` for mysql connections
2024-04-05 15:07:04 +02:00
Julian Brost
80abf2b7b7
Merge pull request #692 from Icinga/ha-logging-i688
Enhance HA "Taking over", "Handing over" logging
2024-04-04 10:33:36 +02:00
Alvar Penning
779afd1da3 Enhance HA "Taking over", "Handing over" logging
The reason for a switch in the HA roles was not always directly clear.
This change now introduces additional debug logging, indicating the
reasoning for either taking over or handing over the HA responsibility.

First, some logic was moved from the SQL query selecting active Icinga
DB instances to Go code. This allowed distinguishing between no
available responsible instances and responsible instances with an
expired heartbeat.

As the HA's peer timeout is logically bound to the Redis timeout, it
will now reference this timeout with an additional grace timeout. Doing
so eliminates a race between a handing over and a "forceful" take over.

As the old code indicated a takeover on the fact that no other instance
is active, it will now additionally check if it is already being the
active/responsible node. In this case, the takeover logic - which will
be interrupted at a later point as the node is already responsible - can
be skipped.

Next to the additional logging messages, both the takeover and handover
channel are now transporting a string to communicate the reason instead
of an empty struct{}. By doing so, both the "Taking over" and "Handing
over" log messages are enriched with reason.

This also required a change in the suppressed logging handling of the
HA.realize method, which got its logging enabled through the shouldLog
parameter. Now, there are both recurring events, which might be
suppressed, as well as state changing events, which should be logged.
Therefore, and because the logTicker's functionality was not clear to me
on first glance, I renamed it to routineLogTicker.

While dealing with the code, some function signature documentation were
added, to ease both mine as well as the understanding of future readers.

Additionally, the error handling of the SQL query selecting active
Icinga DB instances was changed slightly to also handle wrapped
sql.ErrNoRows errors.

Closes #688.
2024-04-02 13:23:11 +02:00
Yonas Habteab
ce56dffa8f history.Sync: Don't operate on closed channel 2024-03-28 14:52:27 +01:00
Yonas Habteab
a8075ea1d1 Validate wsrep_sync_wait database option 2024-03-28 13:25:23 +01:00
Yonas Habteab
735135ea7b Document wsrep_sync_wait database option 2024-03-28 13:24:57 +01:00
Yonas Habteab
9a252a0e9d Drop icingadb#Register() & make mysqlLogger exportable 2024-03-28 13:19:44 +01:00
Yonas Habteab
9713cdc65e Database: Drop registerDriverOnce variable 2024-03-28 13:19:44 +01:00
Yonas Habteab
eaf9744f16 Move pkg/driver to pkg/icingadb/driver.go 2024-03-28 13:19:44 +01:00
Yonas Habteab
cacbae19f3 driver: Move timeout from package level to a function scope
Conflicts with the `timeout` variable in `ha.go` file.
2024-03-28 13:19:44 +01:00
Alexander A. Klimov
4d0b58cfb4 MySQL driver: on connect try setting wsrep_sync_wait, swallow error 1193
In Galera clusters wsrep_sync_wait=7 lets statements catch up all
pending sync between nodes first. This way new child rows await fresh parent
ones from other nodes not to run into foreign key errors. MySQL single nodes
will reject this with error 1193 "Unknown system variable" which is OK.
2024-03-28 13:19:44 +01:00
Yonas Habteab
5348f8127e Driver: Allow to post initialize database connections 2024-03-28 13:19:44 +01:00
Yonas Habteab
e600cf107c Drop superfluous custom driver registration 2024-03-28 13:19:44 +01:00
Alexander A. Klimov
0b94df86a6 Make value for SET SESSION wsrep_sync_wait configurable 2024-03-28 13:19:44 +01:00
Julian Brost
2c468302ae
Merge pull request #657 from Icinga/Flatten-no-exponent
Flatten(): render even large numbers as-is, not using scientific notation
2024-03-25 16:32:32 +01:00
Alexander A. Klimov
17b63b214d Flatten(): render even large numbers as-is, not using scientific notation
E.g. 2000000000000000000 (explicitly), not 2e+18 (as with fmt.Sprintf("%v")).
2024-03-25 14:52:54 +01:00
Alexander A. Klimov
365f97d092 Flatten(): type-check input only once 2024-03-25 14:52:54 +01:00
Alexander A. Klimov
10afc562ce Use types.MakeString() instead of manual initialization (refactor) 2024-03-25 14:52:54 +01:00
Alexander A. Klimov
d36ade1f14 Test Flatten() 2024-03-25 14:52:54 +01:00
Alexander A. Klimov
e2fc7695e0 Introduce types.MakeString() 2024-03-25 14:52:54 +01:00
Julian Brost
194028a35a
Merge pull request #715 from Icinga/go-redis-v9
Upgrade `go-redis` to `v9`
2024-03-25 11:02:02 +01:00
Yonas Habteab
a27a743c27
Merge pull request #684 from Icinga/Al2Klimov-patch-2
Remove redundant closure
2024-03-25 09:26:38 +01:00
Eric Lippmann
e31b101f4f Upgrade go-redis to v9
Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>
2024-03-22 15:32:15 +01:00
Julian Brost
112f6d7966
Merge pull request #699 from Icinga/mysql-strict-mode
MySQL/MariaDB: Use strict SQL mode
2024-03-21 15:49:47 +01:00
Eric Lippmann
16d43cb10a MySQL/MariaDB: Use strict SQL mode
For MySQL (and MariaDB, etc.), in addition to `ANSI_QUOTES` SQL mode,
we now also set `TRADITIONAL`, which enables strict mode.
2024-03-19 09:53:20 +01:00
Eric Lippmann
2586c62251 Retry broken pipe errors (EPIPE) 2024-03-14 09:51:34 +01:00
Alexander Aleksandrovič Klimov
29d1a6bafc
Remove redundant closure
which just wraps a method with equal signature.
2024-03-08 16:41:31 +01:00
Julian Brost
653f356123 Increase database schema version
The removal of the `NOT NULL` constraint on `customvar_flat`.`flat_value` makes
the schema upgrade a hard requirement for 1.1.1.
2023-08-07 13:09:01 +02:00
Julian Brost
9c2dcd2502 Run go fmt ./...
Looks like newer Go version have a different opinion on how indentation in
comments should look like. Adapt existing comments to make the GitHub Actions
happy.
2023-08-04 12:50:45 +02:00
Julian Brost
336ee4a8ab
Merge pull request #554 from Icinga/553
convertFlappingRows(): fix foreign key error history -> flapping_history
2023-07-31 15:00:39 +02:00
Julian Brost
71c1d2fa4d Migration: refactor output/processing of converted entities
This commit simplifies the `icingaDbOutputStage` type to contain only one
entity slice to be insert/upsert. This allows to simplify the handling in
`migrateOneType()` by removing nested loops.

Additionally, a bit of code inside that function is outsourced into a new
`utils.ChanFromSlice()` function. This makes the body of the loop over the
insert/upsert operation (the loop using the `op` variable) simple enough so
that it can just be unrolled which saves the inline struct and slice definition
for that loop.
2023-07-31 11:10:42 +02:00
Julian Brost
62f7ae9114
Merge pull request #609 from Icinga/percona-xtradb-cluster
Support Percona XtraDB Cluster by not using SERIALIZABLE transactions directly
2023-07-31 10:24:10 +02:00
Alexander Aleksandrovič Klimov
99de1079f8
Merge pull request #593 from Icinga/wait-for-database-to-start-rather-than-crashing-561
Merge network and database error retryability detection functions
2023-07-27 17:58:06 +02:00
Julian Brost
68d26a6873
Merge pull request #601 from Icinga/flatten-empty-custom-vars-correctly
Write a hint for empty arrays/dicts into `customvar_flat`
2023-07-25 15:28:07 +02:00
Julian Brost
536c808bca
Merge pull request #605 from Icinga/yaml.DisallowUnknownField
Config parsing: fail on unknown fields and print them
2023-07-25 10:29:29 +02:00
Julian Brost
ef09059549
Merge pull request #612 from Icinga/bool-binary-unixmilli-marshaljson-return-valid-json-not-empty-string
{Bool,Binary,UnixMilli}#MarshalJSON(): return valid JSON, not empty string
2023-07-25 10:28:44 +02:00
Alexander A. Klimov
af868b1762 Config parsing: unit test failure due to unknown fields 2023-07-11 09:57:54 +02:00
Alexander A. Klimov
b8ed25c87a Test UnixMilli#MarshalJSON() 2023-07-07 16:47:45 +02:00
Alexander A. Klimov
7568c47378 Test Bool#MarshalJSON() 2023-07-07 16:47:45 +02:00
Alexander A. Klimov
0745ba7d9e Test Binary#MarshalJSON() 2023-07-07 16:47:45 +02:00
Alexander A. Klimov
0291c860a1 {Bool,Binary,UnixMilli}#MarshalJSON(): return valid JSON, not empty string
in case an instance is null.
2023-07-04 15:38:17 +02:00
Yonas Habteab
fa0a712bac Flatten empty custom vars of type array & map correctly 2023-06-29 13:43:33 +02:00
Alexander A. Klimov
6dc4998802 Support Percona XtraDB Cluster by not using SERIALIZABLE transactions directly
The RDBMS rejects them by default. But it doesn't rejects their equivalent:
Append "LOCK IN SHARE MODE" to every SELECT in a REPEATABLE READ transaction.
Now we do the latter with MySQL.
2023-06-22 15:13:40 +02:00
Alexander A. Klimov
a163694dc6 Config parsing: fail on unknown fields and print them
Useful against config validation or runtime failures
caused by wrong field spelling or YAML indentation.
2023-06-19 17:38:52 +02:00
Julian Brost
78fa223cab
Merge pull request #559 from Icinga/segv
Heartbeat#sendEvent(m): nil-check m before dereferencing it
2023-06-05 12:54:29 +02:00
Alexander A. Klimov
a3c1007d47 retry.Retryable(): treat ENOENT (AF_UNIX) like ECONNREFUSED, i.e. also retry
During connect(2) we may get ECONNREFUSED between server's bind(2) and
listen(2), but the most downtime between boot and service start the socket
won't exist, yet. I.e. ENOENT is the de facto ECONNREFUSED of *nix sockets.
2023-06-05 11:21:45 +02:00
Alexander A. Klimov
e776c99ede Merge network and database error retryability detection functions
so that connection attempts will also be re-tried on RDBMS-specific errors,
e.g. Postgres' 57P03 (the database system is starting up), not to crash.
On the other hand, SQL operations which are safe to retry on SQL errors
are also safe to retry on network errors.
2023-05-26 12:21:09 +02:00
Alexander A. Klimov
5a79a72ff5 Heartbeat#sendEvent(m): nil-check m before dereferencing it
as it can be nil.
2023-01-19 16:55:11 +01:00
Alexander A. Klimov
ab14413393 Log UNIX socket address w/o port number
E.g. not
Connecting to database at '/var/lib/mysql/mysql.sock:0'
Connecting to Redis at '/run/icingadb-redis/icingadb-redis-server.sock:6380'
2022-11-09 11:03:50 +01:00
Julian Brost
a327ef0275
Merge pull request #525 from Icinga/save-memory
Save memory during config sync via SyncSubject#FactoryForDelta()
2022-11-02 12:27:06 +01:00
Alexander A. Klimov
f7d132ccfa Make checkDbSchema() reusable as DB#CheckSchema() 2022-10-11 16:32:22 +02:00
Alexander A. Klimov
adcd004231 Introduce DB#CreateIgnoreStreamed() 2022-10-11 12:46:48 +02:00
Alexander A. Klimov
f063687b2b DB#BuildInsertIgnoreStmt(): handle primary key being not "id" 2022-10-11 12:46:47 +02:00
Alexander A. Klimov
581270ffee cmd/ido2icingadb: support Postgres 2022-10-11 12:46:20 +02:00
Alexander A. Klimov
6209b5b376 Save memory during config sync via SyncSubject#FactoryForDelta()
Code comment TL;DR: Allocate the same amount of smaller data structures
2022-09-13 17:57:23 +02:00
Alexander Aleksandrovič Klimov
305800fdb0
Merge pull request #503 from Icinga/parseString-DRY
parseString(): Don't Repeat Yourself
2022-08-19 11:46:34 +02:00
Eric Lippmann
cf8e12f391 Set Redis server port to 6380 by default
All Icinga DB components use 6380 as default Redis port.
2022-06-29 15:42:04 +02:00
Julian Brost
5d25d81922
Merge pull request #508 from Icinga/state-soft_state
icinga:*:state: rename state to soft_state
2022-06-29 13:25:11 +02:00
Alexander A. Klimov
ae2c40e998 icinga:*:state: rename state to soft_state 2022-06-29 11:56:29 +02:00
Julian Brost
924d455348
Merge pull request #504 from Icinga/retry
Fixes related to the retry functionality
2022-06-29 10:41:31 +02:00
Eric Lippmann
df72c81708 Retry down and unreachable host or network errors 2022-06-29 09:59:12 +02:00
Eric Lippmann
9cb8bf36a6 Fix comment 2022-06-28 21:14:28 +02:00
Eric Lippmann
444332a682 Retry ECONNRESET
ECONNRESET is treated as a temporary error by Go only if it comes from
calling accept.
2022-06-28 19:58:02 +02:00
Julian Brost
e8f611ddc6
Merge pull request #505 from Icinga/make-json-keys-and-db-columns-consistent
Make json keys and DB columns consistent
2022-06-28 16:45:52 +02:00
Eric Lippmann
6a5db1ca94 Retry: Detect ECONNREFUSED properly
Also the order of the checks has been adjusted and the documentation has
been adapted to it. In addition, EAGAIN is no longer checked, since this
is already done via Timeout().
2022-06-28 16:09:32 +02:00
Eric Lippmann
cd96f0de6f Block XREADs for a maxium of one second
I just had the observation that blocking XREADs without timeouts (BLOCK
0) on multiple consecutive Redis restarts and I/O timeouts exceeds Redis
internal retries and eventually leads to fatal errors. @julianbrost
looked at this for clarification, here is his finding:

go-redis only considers a command successful when it returned something,
so a successfully started blocking XREAD consumes a retry attempt each
time the underlying Redis connection is terminated. If this happens
often before any element appears in the stream, this error is
propagated. (This also means that even with this PR, when restarting
Redis often enough so that a query never reaches the BLOCK 1sec, this
would still happen.)

https://github.com/Icinga/icingadb/pull/504#issuecomment-1164589244
2022-06-28 16:09:29 +02:00
Eric Lippmann
d9e876905f Fix Redis MaxRetries
Previously, we set the maximum number of retries to the pool size + 1,
but increased the pool size immediately after this assignment, so the
maximum number of retries was always too low for systems with less than
4 cores. Now it is set the other way around.
2022-06-28 16:09:04 +02:00
Eric Lippmann
5f29caecbe
Merge pull request #464 from Icinga/OwnHeartbeat
Write own status into Redis
2022-06-28 15:21:03 +02:00
Julian Brost
061660b023 Telemetry: use mutex for synchronizing last database error
The old CompareAndSwap based code tended to end up in an endless loop. Replace
it by simple syncrhonization mechanisms where this can't happen.
2022-06-28 13:30:00 +02:00
Julian Brost
def7c5f22c Telemetry: change stats names in Redis
The same names are used in perfdata names and config_sync sounds more natural
than sync_config.
2022-06-28 13:30:00 +02:00
Julian Brost
741460c935 Telemetry: rename keys in heartbeat stream
In both C++ and Go, the keys are only used as constant strings, so namespacing
them just adds clutter for the `general:*` keys, therefore remove it.
2022-06-28 13:30:00 +02:00
Julian Brost
36d5f7b33c Telemetry: send Go metrics as performance data string
Rather than using a JSON structure to convey these values, simply use the
existing format to communicate performance data to Icinga 2.

Also removes the reference to Go in the Redis structure, allowing this string
to be extended with more metrics in the future without running into naming
issues.
2022-06-28 13:30:00 +02:00
Yonas Habteab
fa6c23d634 Make json keys and DB columns consistent 2022-06-24 16:00:11 +02:00
Alexander A. Klimov
e1ff704aff Write own heartbeat into icingadb:telemetry:heartbeat
including version, current DB error and HA status quo.
2022-06-23 18:31:45 +02:00
Alexander A. Klimov
80ab823294 Introduce Atomic[T] 2022-06-23 18:31:45 +02:00
Alexander A. Klimov
64d7f1be43 Remove unused StreamLastId() 2022-06-23 18:31:45 +02:00
Alexander A. Klimov
d85d070d1f Clear icinga:runtime* and read from 0-0 later
instead of preserving the (never read) data and reading beyond its end later.
This indicates the correct number of pending runtime updates
(for monitoring by Icinga 2) from the beginning.
2022-06-23 18:31:45 +02:00
Alexander A. Klimov
9b618c690a XTRIM data XREAD from icinga:runtime*
for Icinga 2 to monitor pending runtime updates.
2022-06-22 17:38:58 +02:00
Alexander A. Klimov
6627ecbfad parseString(): Don't Repeat Yourself 2022-06-22 15:27:00 +02:00
Alexander A. Klimov
2bda98cbe4 oneBulk(): terminate once input closed (like the regular bulker)
instead of outputting zero values.
2022-06-22 12:32:30 +02:00
Alexander A. Klimov
fac9f5e4e5 Write ops/s by op and s to icingadb:telemetry:stats 2022-06-15 09:51:59 +02:00
Alexander A. Klimov
0e5d098be4 DB#CleanupOlderThan(): allow to get done work counted in real time 2022-06-15 09:51:59 +02:00