This change simplifies the use of `attempt` as a number for reading in
log messages and `if`s. Also before, with `attempt` starting with `0`,
the second attempt would have been taken immediately, as our backoff
implementation returns `0` in this case.
Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>
Logging of the `attempt` is a meaningless metric as it is not constantly
logged but only when the retryable error changes, and it has no context
as there is no such thing as max attempts.
The retryable function may exit prematurely due to context errors that
shouldn't be retried. Before, we checked the returned error for context
errors, i.e. used `errors.Is()` to compare it to `Canceled` and
`DeadlineExceeded` which also yields `true` for errors that implement
`Is()` accordingly. For example, this applies to some non-exported Go
`net` errors. Now we explicitly check the context error instead.
All of our error callbacks are used to log the error and indicate that
we are retrying. Previously, in the case of context errors or
non-retryable errors, we would have called these too, which would have
resulted in misleading log messages.
Ensure that updating/inserting of the instance row is also completed by
the current heartbeat's expiry time in takeover scenarios. Otherwise,
there is a risk that the takeover will be performed with an already
expired heartbeat if the attempt takes longer than the expiry time of
the heartbeat.
Before, with `Timeout >0`, we had to check errors for `nil` because we
were using a deadline context that might have cancelled before the retry
function completed with an error, but since we now pass `ctx` as-is,
this is no longer necessary.
Before, the error may have been handled in the `<-ctx.Done()` case in
the `select` for the backoff sleep, since `DeadlineExceeded` errors
return `true` for `Temporary()` and `Timeout()`.
This reverts commit a34aef4fc5.
Before, with `Timeout >0`, we had to return `context canceled` and
`context deadline exceeded` errors from the causing context, but since
we now pass `ctx` as-is, this change is no longer necessary.
Before, `if Timeout >0`, we ensured to stop retrying after `Timeout`
expires by passing a deadline context to `RetryFunc`, which aborts the
function once `Timeout` lapsed - assuming that `context.Done()` is
actually taken into account, which applies to all of our usages. I'm
pretty sure we didn't think about functions that run longer than
`Timeout` and therefore could be canceled prematurely. Since we are now
retrying every database error with a timeout of 5 minutes, this could
happen with queries that wait for locks having a generous lock wait
timeout configured in the database server. Now, `RetryableFunc` is
granted full execution time and will not be canceled if `Timeout` is
exceeded. This means that `WithBackoff` may not stop exactly after
`Timeout` expires, or may not retry at all if the first execution of
`RetryableFunc` already takes longer than `Timeout`.
So far, we have maintained a list of error codes that should be retried.
This has by no means included all errors that can be retried, and errors
that can occur in a database cluster have not even been considered.
Instead of going through all possible error codes and verifying [1]
whether they should be included in the list of retryable errors,
**every** database error is now simply retried. Of course, this also
means that errors are retried that cannot be retried at all, but since
we now give up after 5 minutes, that's fine.
[1] It's hard to tell from a brief vendor error description whether the
error is actually retryable without context of when and how exactly such
errors are triggered. Also, there are database clusters that send their
own errors using vendor error codes.
A float isn't necessary as in Icinga 2 Checkable#max_check_attempts and
check_attempt are ints. But uint8 isn't enough for e.g. 1 check/s to get
HARD after 5m (300s > 255).
The reason for a switch in the HA roles was not always directly clear.
This change now introduces additional debug logging, indicating the
reasoning for either taking over or handing over the HA responsibility.
First, some logic was moved from the SQL query selecting active Icinga
DB instances to Go code. This allowed distinguishing between no
available responsible instances and responsible instances with an
expired heartbeat.
As the HA's peer timeout is logically bound to the Redis timeout, it
will now reference this timeout with an additional grace timeout. Doing
so eliminates a race between a handing over and a "forceful" take over.
As the old code indicated a takeover on the fact that no other instance
is active, it will now additionally check if it is already being the
active/responsible node. In this case, the takeover logic - which will
be interrupted at a later point as the node is already responsible - can
be skipped.
Next to the additional logging messages, both the takeover and handover
channel are now transporting a string to communicate the reason instead
of an empty struct{}. By doing so, both the "Taking over" and "Handing
over" log messages are enriched with reason.
This also required a change in the suppressed logging handling of the
HA.realize method, which got its logging enabled through the shouldLog
parameter. Now, there are both recurring events, which might be
suppressed, as well as state changing events, which should be logged.
Therefore, and because the logTicker's functionality was not clear to me
on first glance, I renamed it to routineLogTicker.
While dealing with the code, some function signature documentation were
added, to ease both mine as well as the understanding of future readers.
Additionally, the error handling of the SQL query selecting active
Icinga DB instances was changed slightly to also handle wrapped
sql.ErrNoRows errors.
Closes#688.
In Galera clusters wsrep_sync_wait=7 lets statements catch up all
pending sync between nodes first. This way new child rows await fresh parent
ones from other nodes not to run into foreign key errors. MySQL single nodes
will reject this with error 1193 "Unknown system variable" which is OK.
Looks like newer Go version have a different opinion on how indentation in
comments should look like. Adapt existing comments to make the GitHub Actions
happy.
This commit simplifies the `icingaDbOutputStage` type to contain only one
entity slice to be insert/upsert. This allows to simplify the handling in
`migrateOneType()` by removing nested loops.
Additionally, a bit of code inside that function is outsourced into a new
`utils.ChanFromSlice()` function. This makes the body of the loop over the
insert/upsert operation (the loop using the `op` variable) simple enough so
that it can just be unrolled which saves the inline struct and slice definition
for that loop.
The RDBMS rejects them by default. But it doesn't rejects their equivalent:
Append "LOCK IN SHARE MODE" to every SELECT in a REPEATABLE READ transaction.
Now we do the latter with MySQL.
During connect(2) we may get ECONNREFUSED between server's bind(2) and
listen(2), but the most downtime between boot and service start the socket
won't exist, yet. I.e. ENOENT is the de facto ECONNREFUSED of *nix sockets.
so that connection attempts will also be re-tried on RDBMS-specific errors,
e.g. Postgres' 57P03 (the database system is starting up), not to crash.
On the other hand, SQL operations which are safe to retry on SQL errors
are also safe to retry on network errors.
Also the order of the checks has been adjusted and the documentation has
been adapted to it. In addition, EAGAIN is no longer checked, since this
is already done via Timeout().
I just had the observation that blocking XREADs without timeouts (BLOCK
0) on multiple consecutive Redis restarts and I/O timeouts exceeds Redis
internal retries and eventually leads to fatal errors. @julianbrost
looked at this for clarification, here is his finding:
go-redis only considers a command successful when it returned something,
so a successfully started blocking XREAD consumes a retry attempt each
time the underlying Redis connection is terminated. If this happens
often before any element appears in the stream, this error is
propagated. (This also means that even with this PR, when restarting
Redis often enough so that a query never reaches the BLOCK 1sec, this
would still happen.)
https://github.com/Icinga/icingadb/pull/504#issuecomment-1164589244
Previously, we set the maximum number of retries to the pool size + 1,
but increased the pool size immediately after this assignment, so the
maximum number of retries was always too low for systems with less than
4 cores. Now it is set the other way around.
Rather than using a JSON structure to convey these values, simply use the
existing format to communicate performance data to Icinga 2.
Also removes the reference to Go in the Redis structure, allowing this string
to be extended with more metrics in the future without running into naming
issues.
instead of preserving the (never read) data and reading beyond its end later.
This indicates the correct number of pending runtime updates
(for monitoring by Icinga 2) from the beginning.