Commit graph

25 commits

Author SHA1 Message Date
Eric Lippmann
7c068d4adf Use icinga-go-library 2024-05-24 09:56:28 +02:00
Eric Lippmann
c070615e64 Move Redis related code to redis 2024-05-22 11:51:22 +02:00
Eric Lippmann
77ccdfc303 Move type related utility functions from internal to types 2024-05-22 11:51:21 +02:00
Eric Lippmann
5029e328c8 Unify notation of n * time.Duration 2024-04-11 13:01:31 +02:00
Alvar Penning
779afd1da3 Enhance HA "Taking over", "Handing over" logging
The reason for a switch in the HA roles was not always directly clear.
This change now introduces additional debug logging, indicating the
reasoning for either taking over or handing over the HA responsibility.

First, some logic was moved from the SQL query selecting active Icinga
DB instances to Go code. This allowed distinguishing between no
available responsible instances and responsible instances with an
expired heartbeat.

As the HA's peer timeout is logically bound to the Redis timeout, it
will now reference this timeout with an additional grace timeout. Doing
so eliminates a race between a handing over and a "forceful" take over.

As the old code indicated a takeover on the fact that no other instance
is active, it will now additionally check if it is already being the
active/responsible node. In this case, the takeover logic - which will
be interrupted at a later point as the node is already responsible - can
be skipped.

Next to the additional logging messages, both the takeover and handover
channel are now transporting a string to communicate the reason instead
of an empty struct{}. By doing so, both the "Taking over" and "Handing
over" log messages are enriched with reason.

This also required a change in the suppressed logging handling of the
HA.realize method, which got its logging enabled through the shouldLog
parameter. Now, there are both recurring events, which might be
suppressed, as well as state changing events, which should be logged.
Therefore, and because the logTicker's functionality was not clear to me
on first glance, I renamed it to routineLogTicker.

While dealing with the code, some function signature documentation were
added, to ease both mine as well as the understanding of future readers.

Additionally, the error handling of the SQL query selecting active
Icinga DB instances was changed slightly to also handle wrapped
sql.ErrNoRows errors.

Closes #688.
2024-04-02 13:23:11 +02:00
Eric Lippmann
e31b101f4f Upgrade go-redis to v9
Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>
2024-03-22 15:32:15 +01:00
Alexander A. Klimov
5a79a72ff5 Heartbeat#sendEvent(m): nil-check m before dereferencing it
as it can be nil.
2023-01-19 16:55:11 +01:00
Eric Lippmann
cd96f0de6f Block XREADs for a maxium of one second
I just had the observation that blocking XREADs without timeouts (BLOCK
0) on multiple consecutive Redis restarts and I/O timeouts exceeds Redis
internal retries and eventually leads to fatal errors. @julianbrost
looked at this for clarification, here is his finding:

go-redis only considers a command successful when it returned something,
so a successfully started blocking XREAD consumes a retry attempt each
time the underlying Redis connection is terminated. If this happens
often before any element appears in the stream, this error is
propagated. (This also means that even with this PR, when restarting
Redis often enough so that a query never reaches the BLOCK 1sec, this
would still happen.)

https://github.com/Icinga/icingadb/pull/504#issuecomment-1164589244
2022-06-28 16:09:29 +02:00
Alexander A. Klimov
e1ff704aff Write own heartbeat into icingadb:telemetry:heartbeat
including version, current DB error and HA status quo.
2022-06-23 18:31:45 +02:00
Eric Lippmann
ccda48234e Use custom logger for accessing the interval for periodic logging 2021-11-05 17:57:22 +01:00
Eric Lippmann
8ce917d45a Remove waiting for heartbeat message
If a heartbeat is pending,
we log it every 60 seconds anyway.
2021-11-05 17:52:11 +01:00
Eric Lippmann
8a03745273 Speak of Icinga heartbeat not Icinga 2 heartbeat 2021-11-05 17:18:03 +01:00
Julian Brost
9b02b18f46 Use new environment ID
https://github.com/Icinga/icinga2/pull/9036 introduced a new environment ID for
Icinga DB that's written to the icinga:stats stream as field
"icingadb_environment". This commit updates the code to make use of this ID
instead of the one derived from the Icinga 2 Environment constant.
2021-11-03 15:47:38 +01:00
Julian Brost
217ab03e59 heartbeat: wrap messages with a timestamp
Track when a heartbeat was received to allow other components to check when it
will expire.
2021-10-04 16:58:35 +02:00
Julian Brost
8b2cb3acb8 heartbeat: use a single channel for all beat/loss events
Using Cond does not allow to reliably catch all events as one will only receive
events that occour after starting to listen. For heartbeat loss events it's
import to reliably catch them to not remain in an HA active state incorrectly.

fixes #360
2021-10-04 16:36:09 +02:00
Julian Brost
17321cdfc3 Fix use of wrong log function on heartbeat loss
Has to use the Warnw function as it passes additional zap attributes.
2021-09-23 09:27:26 +02:00
Eric Lippmann
0b1610c69b Use cancelCtx() instead of just cancel() 2021-08-09 10:29:47 +02:00
Eric Lippmann
725e70f0b9 Pointer receivers, Cond usage, pass ctx and Godoc for Heartbeat
Heartbeat now uses pointer receivers for its methods because
some methods actually change the heartbeat values.
The context is no longer stored in the structure,
but passed to the controller loop.
The beat and the lost channels are replaced by Cond and
the last heartbeat is stored independently to not be affected by
a slow HA receiver. If the database connections are occupied by
the config, HA cannot update the instance and does not read from
the beat channel in time.
In addition, heartbeat errors are no longer swallowed,
but handled in HA.
2021-07-20 10:17:05 +02:00
Eric Lippmann
e12425d8dc Wrap errors 2021-06-21 12:13:24 +02:00
Alexander A. Klimov
35349262ce Use time.NewTicker(), not time.Tick() 2021-05-28 14:24:36 +02:00
Eric Lippmann
372f5cae7c Also log environment info 2021-05-25 16:25:04 +02:00
Noah Hilverling
44c734f72d Improve database and HA logging 2021-05-25 09:49:48 +02:00
Alexander A. Klimov
1026d4cabf Wrap Redis errors 2021-05-19 11:57:58 +02:00
Alexander A. Klimov
4dffbad76e Make channels more specific 2021-03-15 16:34:58 +01:00
Eric Lippmann
77267fa60c Introducte type icingaredis.Heartbeat 2021-03-04 00:49:23 +01:00