Syncing the overdue flag to the host_state and service_state tables was
not HA-aware and assumed to have a consistent copy of all serivces/hosts
with overdue set stored in Redis. However, if a second Icinga DB
instance was active, this set might have become outdated.
This commit fixes this by making the overdue sync HA-aware, i.e. pausing
it if the current instance is inactive. Additionally, it fetches the set
of overdue services from MySQL and syncs it back to Redis when becoming
active.
Selecting any heartbeat from the last 15 seconds from Redis makes little
sense, just wait for new ones. If Icinga 2 just stopped, there is no
point in becoming active for just a few seconds, otherwise the next
heartbeat will arrive in a second anyways.
Example output:
time="2020-12-21T09:22:29+01:00" level=info msg="Taking over." UUID=bbc39673-7281-5064-a71e-91c25dbc6643 context=HA-Testing
time="2020-12-21T09:22:29+01:00" level=info msg="Taking over." UUID=dddedcc5-507e-56d6-adb4-9c6451f6aeb7 context=HA-Testing
time="2020-12-21T09:22:29+01:00" level=info msg="Changing HA state to active"
time="2020-12-21T09:22:29+01:00" level=info msg="Changing HA state to active"
ha_test.go:291:
Error Trace: ha_test.go:291
Error: Not equal:
expected: 1
actual : 2
Test: TestHA_ConcurrentCheckResponsibility
Messages: exactly 1 instance must be active after checkResponsibility() but 2 are active
--- FAIL: TestHA_ConcurrentCheckResponsibility (0.02s)
The HA code generated special notifications for the configsync package.
This commit instead provides a generic notification channel in the HA
package and moves all the special handling to the configsync package
which now generates its special notifications based on the generic ones.
If anything in a transaction fails, the transaction should be retried as
a whole, just retrying some individual command makes no sense and breaks
transaction semantics. For example, if a COMMIT fails in MySQL, it
automatically rolls back the transaction and if you issue another
COMMIT, you just commit an empty transaction.
Remove the test case for commit as is basically tested that it showed
the broken behavior described above.
The previous commit replaced HKEYS with HSCAN, which in addition the the
keys, also returns the values from Redis. With this commit, instead of
the config values the checksum is fetched, or if no checksum is
available, the value is reused.
Redis documentation states that it is preferable to use *SCAN instead of
*KEYS (https://redis.io/commands/keys):
Warning: consider KEYS as a command that should only be used in
production environments with extreme care. It may ruin performance
when it is executed against large databases. This command is intended
for debugging and special operations, such as changing your keyspace
layout. Don't use KEYS in your regular application code. If you're
looking for a way to find keys in a subset of your keyspace, consider
using SCAN or sets.
While the documentation is for the KEYS command and not HKEYS, the same
warning should apply here to as the hashes we are iterating over can get
quite large. Therefore use the HSCAN command that exists as an
alternative.