Ensure that updating/inserting of the instance row is also completed by
the current heartbeat's expiry time in takeover scenarios. Otherwise,
there is a risk that the takeover will be performed with an already
expired heartbeat if the attempt takes longer than the expiry time of
the heartbeat.
Before, with `Timeout >0`, we had to check errors for `nil` because we
were using a deadline context that might have cancelled before the retry
function completed with an error, but since we now pass `ctx` as-is,
this is no longer necessary.
Before, the error may have been handled in the `<-ctx.Done()` case in
the `select` for the backoff sleep, since `DeadlineExceeded` errors
return `true` for `Temporary()` and `Timeout()`.
This reverts commit a34aef4fc5.
Before, with `Timeout >0`, we had to return `context canceled` and
`context deadline exceeded` errors from the causing context, but since
we now pass `ctx` as-is, this change is no longer necessary.
Before, `if Timeout >0`, we ensured to stop retrying after `Timeout`
expires by passing a deadline context to `RetryFunc`, which aborts the
function once `Timeout` lapsed - assuming that `context.Done()` is
actually taken into account, which applies to all of our usages. I'm
pretty sure we didn't think about functions that run longer than
`Timeout` and therefore could be canceled prematurely. Since we are now
retrying every database error with a timeout of 5 minutes, this could
happen with queries that wait for locks having a generous lock wait
timeout configured in the database server. Now, `RetryableFunc` is
granted full execution time and will not be canceled if `Timeout` is
exceeded. This means that `WithBackoff` may not stop exactly after
`Timeout` expires, or may not retry at all if the first execution of
`RetryableFunc` already takes longer than `Timeout`.
So far, we have maintained a list of error codes that should be retried.
This has by no means included all errors that can be retried, and errors
that can occur in a database cluster have not even been considered.
Instead of going through all possible error codes and verifying [1]
whether they should be included in the list of retryable errors,
**every** database error is now simply retried. Of course, this also
means that errors are retried that cannot be retried at all, but since
we now give up after 5 minutes, that's fine.
[1] It's hard to tell from a brief vendor error description whether the
error is actually retryable without context of when and how exactly such
errors are triggered. Also, there are database clusters that send their
own errors using vendor error codes.
A float isn't necessary as in Icinga 2 Checkable#max_check_attempts and
check_attempt are ints. But uint8 isn't enough for e.g. 1 check/s to get
HARD after 5m (300s > 255).
As reported in #726, the default values for `database.options` are
overwritten by go-yaml. By commenting out the key of the
empty/unmodified YAML dictionary, this bug is mitigated. To make this
change consistent, the keys of all other unmodified dictionary blocks
have also been commented out.
Close#726.
The reason for a switch in the HA roles was not always directly clear.
This change now introduces additional debug logging, indicating the
reasoning for either taking over or handing over the HA responsibility.
First, some logic was moved from the SQL query selecting active Icinga
DB instances to Go code. This allowed distinguishing between no
available responsible instances and responsible instances with an
expired heartbeat.
As the HA's peer timeout is logically bound to the Redis timeout, it
will now reference this timeout with an additional grace timeout. Doing
so eliminates a race between a handing over and a "forceful" take over.
As the old code indicated a takeover on the fact that no other instance
is active, it will now additionally check if it is already being the
active/responsible node. In this case, the takeover logic - which will
be interrupted at a later point as the node is already responsible - can
be skipped.
Next to the additional logging messages, both the takeover and handover
channel are now transporting a string to communicate the reason instead
of an empty struct{}. By doing so, both the "Taking over" and "Handing
over" log messages are enriched with reason.
This also required a change in the suppressed logging handling of the
HA.realize method, which got its logging enabled through the shouldLog
parameter. Now, there are both recurring events, which might be
suppressed, as well as state changing events, which should be logged.
Therefore, and because the logTicker's functionality was not clear to me
on first glance, I renamed it to routineLogTicker.
While dealing with the code, some function signature documentation were
added, to ease both mine as well as the understanding of future readers.
Additionally, the error handling of the SQL query selecting active
Icinga DB instances was changed slightly to also handle wrapped
sql.ErrNoRows errors.
Closes#688.
In Galera clusters wsrep_sync_wait=7 lets statements catch up all
pending sync between nodes first. This way new child rows await fresh parent
ones from other nodes not to run into foreign key errors. MySQL single nodes
will reject this with error 1193 "Unknown system variable" which is OK.