Something is definitely going wrong if a client tries to reconnect to
this endpoint while it still has an active connection to that client. So
we shouldn't hide this, but at least log it at info level. Apart from
that, I've added some additional information about the currently active
client, such as when the last message was sent and received.
The Redis ACL system was introduced with Redis 6.0. It introduced users
with precisely granular permissions. This change allows Icinga 2 to use
the Icinga DB feature against a Redis with an ACL user.
This was reflected in the documentation, next to the already
implemented, but undocumented Redis database.
Closes#9536.
Each configuration field of an IcingaDB Object was marked with
no_user_modify as modifications via the API would not result in an
actual change. While the Object would be updated, the internal Redis
connection would not be restarted, resulting in an unexpected behavior.
The missing db_index was added to the documentation.
The previous validation in set_verify_callback() could be bypassed, tricking
Icinga 2 into treating invalid certificates as valid. To fix this, the
validation checks were moved into the IsVerifyOK() function.
This is tracked as CVE-2024-49369, more details will be published at a later time.
This fixes an issue where recovery notifications get lost if they happen
outside of a notification time period.
Not all calls to `Checkable::NotificationReasonApplies()` need
`GetStateBeforeSuppression()` to be checked. In fact, for one caller,
`FireSuppressedNotifications()` in
`lib/notification/notificationcomponent.cpp`, the state before suppression may
not even be initialized properly, so that the default value of OK is used which
can lead to incorrect return values. Note the difference between suppressions
happening on the level of the `Checkable` object level and the `Notification`
object level. Only the first sets the state before suppression in the
`Checkable` object, but so far, also the latter used that value incorrectly.
This commit moves the check of `GetStateBeforeSuppression()` from
`Checkable::NotificationReasonApplies()` to the one place where it's actually
relevant: `Checkable::FireSuppressedNotifications()`. This made the existing
call to `NotificationReasonApplies()` unneccessary as it would always return
true: the `type` argument is computed based on the current check result, so
there's no need to check it against the current check result.
`m_IsNoOp` was introduced to avoid building up log messages that will later be
discarded, like debug messages if no debug logging is configured. However, it
looks like the template operator<< implemented in the header file was forgotten
when adding this feature, all other places writing into `m_Buffer` already have
an if guard like added by this commit.
error: no matching function for call to 'intrusive_ptr_release'
...
candidate function not viable: cannot convert argument of incomplete type 'icinga::Notification *' to 'Object *' for 1st argument
void intrusive_ptr_release(Object *object);
Especially ApiListener#ReplayLog() enqueued lots of messages into
JsonRpcConnection#{m_IoStrand,m_OutgoingMessagesQueue} (RAM) even if
the connection was shut(ting) down. Now #Disconnect() takes effect ASAP.
This allows the function to be used both with a double timestamp or a pointer
to a tm struct. With this, a similar implementation inside the tests can simply
use our regular function.
So far, the return value of strftime() was simply ignored and the output buffer
passed to the icinga::String constructor. However, there are error conditions
where strftime() returns 0 to signal an error, like if the buffer was too small
for the output. In that case, there's no guarantee on the buffer contents and
reading it can result in undefined behavior. Unfortunately, returning 0 can
also indicate success and strftime() doesn't set errno, so there's no reliable
way to distinguish both situations. Thus, the implementation now returns the
empty string in both cases.
I attempted to use std::put_time() at first as that allows for better error
handling, however, there were problems with the implementation on Windows (see
inline comment), so I put that plan on hold at left strftime() there for the
time being.
localtime() is not thread-safe as it returns a pointer to a shared tm struct.
Everywhere except on Windows, localtime_r() is used already which avoids the
problem by using a struct allocated by the caller for the output.
Windows actually has a similar function called localtime_s() which has the same
properties, just with a different name and order of arguments.
The previous implementation actually had undefined behavior when called with a
double that can't be represented as time_t. With boost::numeric_cast, there's a
convenient cast available that avoids this and throws an exceptions on
overflow.
It's undefined behavior ([0], where the implicit conversion rule comes into
play because the C-style cast uses static_cast [1] which in turn uses the
imlicit conversion as per rule 5 of [2]):
> A prvalue of floating-point type can be converted to a prvalue of any integer
> type. The fractional part is truncated, that is, the fractional part is
> discarded.
>
> * If the truncated value cannot fit into the destination type, the behavior
> is undefined (even when the destination type is unsigned, modulo arithmetic
> does not apply).
Note that on Linux amd64, the undefined behavior typically manifests itself in
the result being the minimal value of time_t which then results in localtime_r
failing with EOVERFLOW.
[0]: https://en.cppreference.com/w/cpp/language/implicit_conversion#Floating.E2.80.93integral_conversions
[1]: https://en.cppreference.com/w/cpp/language/explicit_cast
[2]: https://en.cppreference.com/w/cpp/language/static_cast
Currently, when processing a `CheckResult`, it will first trigger an
`OnNextCheckChanged` event, which is sent to all connected endpoints.
Then, when `Checkable::ProcessCheckResult()` returns, an `OnCheckResult`
event is fired, which is of course also sent to all connected endpoints.
Next, the other endpoints receive the `event::SetNextCheck` cluster
event followed by `event::CheckResult`and invoke
`checkable#SetNextCheck()` and `Checkable#CheckResult()` with the newly
received check. So they also try to recalculate the next check
themselves and invalidate the previously received next check timestamp
from the source endpoint. Since each endpoint randomly initialises its
own scheduling offset, the recalculated next check will always differ by
a split second/millisecond on each of them. As a consequence, two Icinga
DB HA instances will generate two different checksums for the same state
and causes the state histories to be fully resynchronised after a
takeover/Icinga 2 reload.
A day specification like "monday -1" refers to the last Monday of the month.
However, there was an off by one if the first day of the next month is the same
day of the week, i.e. a Monday in this example.
LegacyTimePeriod::FindNthWeekday() picks a day to start the search for the day
in question. When given a negative n to search for the n-th last day, it
wrongly used the first day of the following month as the start and counted it
as if it was within the current month. This resulted in a 1/7 chance that the
result was one week too late.
This is fixed by using the last day of the current month instead.
When setting up Icinga 2 agents, in most cases, the default global
zones are not needed, but have to be removed manually or automatically
whith tools outside of Icinga 2 from the configuration.
This seems like unnecessary work, since the node setup command
does everything else.
This commit introduces a new option for the node setup
command ("--no-default-global-zones") to exclude the default global zones.
Before ae693cb7e1 (#9577) we've repeatedly looped over all items in parallel like this:
while not types.done:
for t in types:
if not t.done and t.dependencies.done:
with parallel(all_items, CONCURRENCY) as some_items:
for i in some_items:
if i.type is t:
i.commit()
I.e. all items got distributed over CONCURRENCY threads, but not always equally. E.g. it was the hosts' turn, but only two threads got hosts and did all the work. The others didn't do actual work (due to the lack of hosts in their queue) which reduced the performance. c721c302cd (#6581) fixed it by shuffling all_items first. ae693cb7e1 (#9577) made the latter unnecessary by replacing the above algorithm with this:
while not types.done:
for t in types:
if not t.done and t.dependencies.done:
with parallel(all_items[t], CONCURRENCY) as some_items:
for i in some_items:
if i.type is t:
i.commit()
I.e. parallel() gets only items of type t, so all threads get e.g. hosts.
When a HTTP connection dies prematurely while the response is sent,
`http::async_write()` sets the error code to something like broken pipe for
example. When calling `async_flush()` afterwards, it sometimes happens that
this never returns. This results in a resource leak as the coroutine isn't
cleaned up. This commit makes the individual functions throw exceptions instead
of silently ignoring the errors, resulting in the function terminating early
and also resulting in an error being logged as well.
Given that the internal `config::Update` cluster events are using this
as well to create received runtime objects, we don't want to persist
first the conf file and the load and validate it with `CompileFile`.
Otherwise, we are forced to remove the newly created file whenever we
can't validate, commit or activate it. This also would also have the
downside that two cluster events for the same object arriving at the
same moment from two different endpoints would result in two different
threads simultaneously creating and loading the same config file -
whereby only one of the surpasses the validation, while the other is
facing an object `re-definition` error and tries to remove that config
file it mistakenly thinks it has created. As a consequence, an object
successfully created by the former is implicitly deleted by the latter
thread, causing the objects to mysteriously disappear.
Closing and re-opening that very same log file shouldn't reset the
counter, otherwise some log files may exceed the max limit per file as
their offset indicator is reset each time they are re-opened.
On incoming connection timeout we log the remote endpoint which isn't
available if it was already disconnected - an exception is thrown. Get it
as long as we're still connected not to lose it, nor to get an exception.
While analyzing a possible memory leak, we encountered several coroutine
exception messages, which unfortunately do not provide any information
about what exactly went wrong, as exception diagnostics were previously
only logged at the notice level.
Although there is locking involved here, it shoudln't take too long for
the thread to actually acquire it, since there aren't that many threads
dealing with endpoint clients concurrently. It's just wasting pointless
time trying to obtain a CPU slot.
The table sla_history_downtime requires a downtime_end.
The Go daemon takes the cancel_time if has_been_cancelled is 1.
So we must supply a cancel_time whereever has_been_cancelled is 1.
Otherwise the Go daemon can't process some entries.
This code was added in commit 548eb93 and never did anything useful.
Using X509_get_signature_nid() or its expanded version in the pre-1.1
branch is the correct way of retrieving the signature algorithm of a
certificate.
CLA: trivial
On shutdown or HA re-connect ConfigObject#SetAuthority(false) is called which
does ObjectLock(this) and ConfigObject#Pause(). GelfWriter#Pause(), with the
above ObjectLock, calls m_WorkQueue.Join(). But items inside that also doing
ObjectLock(this) cause a deadlock.
Non-ECC DHE ciphers in the `cipher_list` attribute of `ApiListener` (the
default value includes these) had no effect as no DH parameters were available
and therefore the server wouldn't offer these ciphers. OpenSSL provides
built-in DH parameters starting from version 1.1.0, however, these have to be
enables explicitly using the `SSL_CTX_set_dh_auto()` function. This commit does
so and thereby makes it possible to establish a connection to an Icinga 2
server using a DHE cipher.
and, in case of null, fall back to Checkable#check_command.timeout, just like
IcingaDB#SerializeState(). Otherwise the Go daemon crashes. It expects a number.
As MacroProcessor checked just for CustomVarObject base class, but
IcingaApplication provided the vars attribute by itself, it had to also
resolve CV macros by itself. That logic diverged from MacroProcessor so that
macros inside IcingaApplication CVs weren't resolved. Until now.
At the moment, the Icinga DB feature will use that value as-is and
serialize it to JSON, resulting in a crash in Icinga DB down the road
because it expects a boolean.
the last state change could be a long time ago. If it's longer than
the new downtime's duration, the downtime expires immediately.
trigger time + duration < now
The problem was that some PerfData labels contained several whitespace characters,
not just one, and therefore it was parsed incorrectly in `SplitPerfdata()`. I.e. the condition
in line 144 checks whether the first and last character is a normal quote, but since the
label can contain spaces at the beginning and at the end respectively, this caused the problems.
This PR fixes the problem by removing all occurring whitespace from the beginning and end,
before starting to parse the actual label.
This allows for a faster config load due to less locking required.
The change is slightly backwards-incompatible. Before, you could manipulate the
globals namespace at a later stage, but disallowing this feels reasonable for
the performance benefit alone (which especially shows on many-core machines).
Apart from that, it's doubtful if doing so is even useful at all as the DSL
provides no mechanism for you to synchronize your operations that may run in
parallel. The data structures itself are protected from race conditions, but
anything implemented on top of this may still be subject to race conditions.
And even if some user has a good reason for doing this, there's a feasible
workaround by creating your own namespace like globals.mutable and using that
instead.
If a connection hangs for too long in ApiListener#NewClientHandler(),
ApiListener#AddConnection()'s Timeout calls boost::asio::basic_socket#cancel()
on that connection to trigger an exception which unwinds
ApiListener#NewClientHandler(). Previously that unwind could trigger a Defer
which called boost::asio::ssl::stream#async_shutdown() which extended the hang.
Normally if for some reason an ack comment still exists on a checkable not
acked anymore, still clean it up. But while replaying log config objects
incl. ack comments come before check results and acks. I.e. 1) ack comment,
2) DOWN check result and 3) ack. Not 1) DOWN check result, 2) ack and 3) ack
comment. So the checkable is temporarily not acked, but already has the ack
comment. In this case the DOWN check result which is older than the ack
comment shall not clean up the latter.
Traditional behaviour was to regard all dependecies as cumulative (e.g., the parent considered unreachable if any one dependency is violated), commit ed58922389 made all dependencies regarded redundant (e.g., the parent considered unreachable only if all dependency are violated). This may lead to unrelated services (or even hosts vs. services) inadvertantly regarded to be redundant to each other.
Most importantly, applying the explicit "disable-host-service-checks" dependency described in the "Monitoring Basics" chapter will defeat all other dependencies.
This commit introduces a new "redundancy_group" attribute for dependencies.
Specifying a redundancy_group causes a dependency to be regarded as redundant only inside that redundancy group.
Dependencies lacking a redundancy_group attribute are regarded as essential for the parent.
This allows for both cumulative and redundant dependencies and even a combination (cumulation of redundancies, like SSH depeding on both LDAP and DNS to function, while operating redundant LDAP servers as well as redundant DNS resolvers).
This commit lacks changes to the tests.
master before #9627 (a0286e9c6):
<1> => namespace n { x = 42; x = 42 }
^^^^^^
Constant must not be modified.
<2> =>
HEAD of #9627 (24b57f0d3):
<1> => namespace n { x = 42; x = 42 }
null
<2> =>
if caused by dependency or check period.
Now as long as any of the above causes check skips
next check and next update will be up-to-date in Icinga DB,
so the checkable won't slide into false positive overdue.
This was accidentally broken by #9627 because during config sync, a config
validation happens that uses `--define System.ZonesStageVarDir=...` which fails
on the now frozen namespace.
This commit changes this to use `Internal.ZonesStageVarDir` instead. After all,
this is used for internal functionality, users should not directly interact
with this flag.
Additionally, it no longer freezes the `Internal` namespace which actually
allows using `Internal.ZonesStageVarDir` in the first place. This also fixes
`--define Internal.Debug*` which was also broken by said PR. Freezing of the
`Internal` namespace is not necessary for performance reasons as it's not
searched implicitly (for example when accessing `globals.x`) and should users
actually interact with it, they should know by that name that they are on their
own.
This commit moves the initialization of the globals.Types namespace to type.cpp
in order to keep a pointer to the Namespace object in Type::m_Namespace and
simplify Type::GetByName() using it.
The dynamic type check is moved into an assertion after freezing the namespace.
This makes freezing a namespace an irrevocable operation but in return allows
omitting further lock operations. This results in a performance improvement as
reading an atomic bool is faster than acquiring and releasing a shared lock.
ObjectLocks on namespaces remain untouched as these mostly affect write
operations which there should be none of after freezing (if there are some,
they will throw exceptions anyways).
This commit removes EmbeddedNamespaceValue and ConstEmbeddedNamespaceValue and
reduces NamespaceValue down to a simple struct without inheritance or member
functions. The code from these clases is inlined into the Namespace class. The
class hierarchy determining whether a value is const is moved to an attribute
of NamespaceValue.
This is done in preparation for changes to the locking in the Namespace class.
Currently, it relies on a recursive mutex. In the future, a shared mutex
(read/write lock) should be used instead, which cannot allow recursive locking
(without failing or risk deadlocking on lock upgrades). With this change, all
operations requiring a lock for one operation are within one function, no
recursive locking is not needed any more.
This commit adds a new initialization priority `FreezeNamespaces` that is run
last and moves all calls to `Namespace::Freeze()` there. This allows all other
initialization functions to still update namespaces without the use of the
`overrideFrozen` flag.
It also moves the initialization of `System.Platform*` and `System.Build*` to
an initialize function so that these can also be set without setting
`overrideFrozen`.
This is preparation for a following commit that will make the frozen flag in
namespaces finial, no longer allowing it to be overriden (freezing the
namespace will disable locking, so performing further updates would be unsafe).
Now that all values are in one place, there is no reason for this numbering
with gaps anymore. If you need to insert a new value in between, you can just
do so in the enum.
This reverses the sort order of the enum, thereby requiring a change to the
sort order of the std::priority_queue containing the elements.
Change the type of the priority values from int to a new enum. By replacing the
magic int values throughout the code base with named values, there is now a
single place where all priority values are defined and you get an overview over
the initialization order.
InitializeOnceHelper calls Loader::AddDeferredInitializer which takes a
std::function, so it's eventually converted to that anyways. This commit just
does this a bit earlier, and by saving the step of the intermediate C function
pointer, this would now also work for capturing lambdas (which there are none
of at the moment).
to restore the behavior before the previous commit. Otherwise we'd delete all
API object's child objects' files including applied child object rules in /etc.
In essence, namespace behaviors acted as hooks for update operations on
namespaces. Two behaviors were implemented:
- `NamespaceBehavior`: allows the update operation unless it acts on a value
that itself was explicitly marked as constant.
- `ConstNamespaceBehavior`: initially allows insert operations but marks the
individual values as const. Additionally provides a `Freeze()` member
function. After this was called, updates are rejected unless a special
`overrideFrozen` flag is set explicitly.
This marvel of object-oriented programming can be replaced with a simple bool.
This commit basically replaces `Namespace::m_Behavior` with
`Namespace::m_ConstValues` and inlines the behavior functions where they were
called. While doing so, the code was slightly simplified by assuming that
`m_ConstValues` is true if `m_Frozen` is true. This is similar to what the API
allowed in the old code as you could only freeze a `ConstNamespaceBehavior`.
However, this PR moves the `Freeze()` member function and the related
`m_Freeze` member variable to the `Namespace` class. So now the API allows any
namespace to be frozen. The new code also makes sense with the previously
mentioned simplification: a `Namespace` with `m_ConstValues = false` can be
modified without restrictions until `Freeze()` is called. When this is done, it
becomes read-only.
The changes outside of `namespace.*` just adapt the code to the slightly
changed API.
Copying an ObjectLock results in the underlying mutex being unlocked too often.
There's also no good reason for copying a scoped locking class (if at all, it
should be moved).
instead of computing from scratch if they're in the _api package.
For now this changes literally nothing as paths of existing objects still follow
the scheme of paths of new objects which didn't change. Now Icinga only doesn't expect
existing objects at particular paths. However, with the latter in v2.14+ (agent,
satellite) we can just change the path scheme of new objects in v2.16+ (master)
as we wish. The child nodes will just follow the new scheme of paths of new objects.
This prevents the `m_HasMatches` property from being altered simultaneously.
This might seem harmless (since this property can only be set to true by any calling thread),
however, from a technical (C++) point of view, this constitutes a data race.
1. The lookup of apply rules per source type now implies
no String(const char*) (no malloc()) and just pointer (uint64) comparisions
2. Apply rules are now also grouped by target type via a nested map, that obsoletes
checking the target type while iterating over all rules per source type
When committing an item with `ignore_on_error` flag set fails, the `Commit()` method only returns `nullptr`
and the current item is not being dropped from `m_Items`. `CommittNewItems()` also doesn't check the return
value of `Commit()` but just continues and tries to commit all items from `m_Items` in recursive call. Since
this corrupt item is never removed from `m_Items`, it ends up in an endless recursion till it finally crashes.
by caching the total minimum log severity of all loggers in a
"global variable" and whether a message's severity is large enough for any of
the loggers in a per-message no-op flag.
1. Don't set a custom handler for SIGCHLD (in the umbrella process)
as that handler doesn't actually handle SIGCHLD anymore
2. Don't reset the SIGCHLD handler (in the worker process)
as there's nothing to reset anymore due to the above change
3. Don't block SIGCHLD across fork(2) as its handler doesn't change anymore
due to the above changes
Before:
On SIGCHLD from the forked worker the umbrella process sets a failure flag.
StartUnixWorker() recognises that and does waitpid(), failure message, etc..
On OpenBSD we can't tell the signal source, so we always set the failure flag.
That's not how our IPC shall work, that breaks the IPC sooner or later.
After:
No SIGCHLD handling and no failure flag setting.
Instead StartUnixWorker()'s wait loop uses waitpid(x,y,WNOHANG)
to avoid false positives while watching the forked worker.
This ensures that `frame.Depth` is only decreased when preceding `frame.IncreaseStackDepth()` callee was successful.
This way, `frame.Depth` will have the same depth prior to and after evaluating a frame.
Before (time: vertical, stack: horizontal):
* Checkable::ExecuteCheck
* Checkable::UpdateNextCheck
* IcingaDB::NextCheckChangedHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* RandomCheckTask::ScriptFunc
* Checkable::ProcessCheckResult
* Checkable::UpdateNextCheck
* IcingaDB::NextCheckChangedHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::NewCheckResultHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::StateChangeHandler
* XADD icinga:runtime:state
* IcingaDB::ForwardHistoryEntries
* XADD icinga:history:stream:state
After:
* Checkable::ExecuteCheck
* Checkable::UpdateNextCheck
* RandomCheckTask::ScriptFunc
* Checkable::ProcessCheckResult
* Checkable::UpdateNextCheck
* IcingaDB::NewCheckResultHandler
* HSET icinga:host:state
* HSET icinga:checksum:host:state
* ZADD icinga:nextupdate:host
* IcingaDB::StateChangeHandler
* XADD icinga:runtime:state
* IcingaDB::ForwardHistoryEntries
* XADD icinga:history:stream:state
The first state + nextupdate (for overdue) update comes from next_check being
set to now + interval immediately before doing the actual check (not to trigger
it twice). This update is not only not important for the end user, but even
inappropriate. The end user SHALL see next_check being e.g. in -4s, not 5m, as
the check is running at the moment.
The second one is just redundant as IcingaDB::NewCheckResultHandler (the third
one) is called anyway and will update state + nextupdate as well.
Case:
1. icinga2 api setup
2. icinga2 daemon -C -x debug
Before: Second commands crashes at exit.
After: No crash.
As the comment between the removed lines clearly says:
Our destructors haven't been built for static data.
This is build type independent.
The point of logging to the Windows Event Log was to catch errors that happen
before the full logging configuration has been loaded and enabled. Messages
like the number of loaded objects per type just cause noise in the log and
provide little benefit. Therefore raise the required log level at this stage.
Note that this commit removes the (never documented) ability to use the -x flag
to change the level. But doing so would require patching the command line of
the service in the registry anyways.
If some kind of query is not supposed to be processed at the moment, there is
little point in checking it. During a full dump, state updates are suppressed
(i.e. delayed), so when a dump takes very long, this would have resulted in a
false Redis backlog warning.
The check makes no attempt to explicitly connect to Redis, it uses the
connection of the IcingaDB feature, so this message better describes the state
in this situation.
IcingaDB::GetConnection() uses IcingaDB::m_Rcon which is only initialized in
IcingaDB::Start(), therefore add a nullptr check to the check command.
Additionally, as m_Rcon is potentially accessed concurrently, add a copy of the
value that is safe for concurrent use.
icingadb-web shows multiple lines from the check output collapsed into a single
line. The lines containing just minuses make this look cluttered and making
making it a heading provides little to no benefit. Even when rendering markdown
in the check output at some point, having the lists labeled using normal
paragraphs would look just fine.
- Add icinga2_ and icingadb_ prefixes to make clear which component is
responsible for the value.
- Rename heartbeat_lag to heartbeat_age, describes it better in my opinion and
sound a bit less like something that should be as close to zero as possible.
- Rename redis_dump/database_sync into full_dump/full_sync as this is how these
operations are refered to in log messages as well.
- Rename Redis backlog into Redis query backlog, makes it a bit clearer in my
opinion.
- Rename runtime_backlog into runtime_update_backlog, as the component in
Icinga DB is called that way and this naming is also exposed in log messages.
- Rename dump_config/state/history into config/state/history_dump, makes it
sound more natural.
They add no additional information compared to the *_1min values as it's always
the same value divided by 60 anyways. Adding the actual value from the last
second makes little sense for realistic values of check_interval.
Use the already existing format to pass performance data to Icinga 2 rather
than some new JSON structure. Has the additional benefit of doing more things
in Go than in C++.
IcingaDB may receive callbacks from Boost signals before being fully started.
This resulted in situations where m_EnvironmentId was used before it was
initialized properly. This is fixed by initializing it earlier (during the
config validation stage). However, at this stage, it should not yet write to
disk, therefore, persisting the environment ID to disk is delayed until later
in the startup process.
Initializing at this stage has an extra benefit: if there is an error for some
reason (possibly corrupt icingadb.env file), this now shows up as a nice error
during config validation.
Additionally, this replaces the use of std::call_once with std::mutex due to
bug in libstdc++ (see inline comment for reference).
Icinga 2 treats null (Empty) as if the corresponding attribute is not
specified. However, without this commit, it would serialize the value as "null"
(i.e. type string), so that it ends up in the database as this string instead
of NULL. This commit adds handling for ValueEmpty so that is serialized as JSON
null value and ends up in the database as NULL.
Apparently there was a reason for making the members of generated classes
atomic. However, this was only done for some types, others were still accessed
using non-atomic operations. For members of type T::Ptr (i.e. intrusive_ptr<T>),
this can result in a double free when multiple threads access the same variable
and at least one of them writes to the variable.
This commit makes use of std::atomic<T> for more T (it removes the additional
constraint sizeof(T) <= sizeof(void*)) and uses a type including a mutex for
load and store operations as a fallback.
The very same object is already serialized a few lines above, the result is
even stored in a variable, but that variable was not used before. Simply using
this variable results in a noticeable improvement of config validation times.
Checkable::FireSuppressedNotifications() compares the time of the current
checkable with the last recovery time of parents to avoid notification right
after a parent recovered and before the current checkable was checked.
This commit makes this check also include to host if the checkable is a
service. This makes the behavior consistent with the documentation that states
there is an implicit dependency on the host (which isn't realized as implicitly
generating a Dependency object unfortunately).
as they would be re-connected in Resume() (HA).
Before they were still connected during pause and connected X+1 times
after X split-brains (the same data was written X+1 times).
To prevent Icinga2 from being restarted while
one or more requests are still in progress and end up
as corrupted stages without status file and startup logs.
If comments get removed in unintended ways (i.e. not by expiring or by using
the remove-comment API action), the comment object misses information to create
a proper history event for Icinga DB. Therefor, discard these events.
On ack Icinga first adds a comment, then acks the checkable
so the ack event has the comment ID.
But due to the yet missing ack the comment was missing is_sticky.
That's corrected now.
Icinga DB doesn't expect comment history for ack comments.
Before:
1. Acked checkable recovers
2. Icinga clears ack comments w/o setting removal time
3. Icinga DB gets neither removal time, nor expire time
4. Icinga DB falls back to NULL and violates NOT NULL constraint
icinga::Array requires locking by the caller when iterating using Begin() and
End(). This is only checked in debug builds but there it makes this function
fail.
This commit changes the Checkable notification suppression logic (notifications
are currently suppressed on the Checkable if it is unreachable, in a downtime,
or acknowledged) to that after the suppression reason ends, a state
notification is sent if and only if the first hard state after is different
from the last hard state from before. If the checkable is in a soft state after
the suppression ends, the notification is further suppressed until a hard state
is reached.
To achieve this behavior, a new attribute state_before_suppression is added to
Checkable. This attribute is set to the last hard state the first time either a
PROBLEM or a RECOVERY notification is suppressed. Compared to from before,
neither of these two flags in the suppressed_notification will ever be cleared
while the supression is still ongoing but only after the suppression ended and
the current state is compared with the old state stored in
state_before_suppression.
Without this commit, children and parents of a checkable were rescheduled on a
state change while holding the lock for the current checkable. If both ends of
a dependency are checked at the same time and both change state, they could end
up in a deadlock waiting for each other.
This commit fixes this problem by changing the code so that other checkables
are rescheduled only after releasing the lock for the current checkable.
When there are multiple active IDO instances on the same node, before this
commit, all of them would share a single DbValue object for the notification_id
column of the icinga_contactnotifications table. This resulted in the issue
that one database references the notification_id in another database.
This commit fixes this by using a separate DbValue value for each IDO instance.
This needs a new signal as the existing OnQuery and OnMultipleQueries signals
perform the same queries on all IDO instances, but different queries are needed
here per instance (they only differ in the referenced DbValue). Therefore, a
new signal OnMakeQueries is added that takes a std::function which is called
once per IDO instance and can access callbacks to perform one or multiple
queries only on this specific IDO instance.
to ensure Checkable#IsReachable() returns correctly for dependency children inside OnReachabilityChanged().
That needs the dependency parent to be already in the correct state.
refs #9143
StateChangeHandler() is the function used when the actual hard/soft state
changes and thus also writes state history. This is not desired in this case,
instead, a runtime update should be generated, therefore call UpdateState()
instead.
refs #9063
StateChangeHandler() is the function used when the actual hard/soft state
changes and thus also writes state history. This is not desired in this case,
instead, a runtime update should be generated, therefore call UpdateState()
instead.
refs #9063
StateChangeHandler() is the function used when the actual hard/soft state
changes and thus also writes state history. This is not desired in this case,
instead, a runtime update should be generated, therefore call UpdateState()
instead.
refs #9063
Previously, both funktions did related operations but had unclear and confusing
naming:
- UpdateState updated the icinga:{host,service}:state Redis keys.
- SendStatusUpdate sent a runtime update for the icinga:{host,service}:state.
This commit merges both functions into one with a new mode parameter. The
following modes are now supported:
- Volatile: Update the icinga:{host,service}:state Redis key.
- Full: Perform the volatile state update and in addition send a corresponding
runtime update so that this state update gets written through to the
persistent database by a running icingadb process.
- RuntimeOnly: Special mode for callers that can ensure that a volatile update
for the current state was already performed but has to be upgraded to a full
update.
refs #9063
When a comment or downtime is removed manually, the name of the requestor and
timestamp have to be synced to other nodes in the cluster to allow all of them
to generate a consistent Icinga DB history stream.
refs #9101
i.e. the confusion of the state file deserializator with e.g. `"type":32` on startup.
That would unexpectedly restore (the now ignored) null (not `{"type":32}`) as there's no type "32".
refs #8186
The config parser requires *Command#arguments#order to be a Number, i.e. 42,
4.2 or even "4.2". That's int-casted where needed, now also for Icinga DB.
Before:
```
object CheckCommand "9117" {
command = [ "true" ]
arguments = {
"4.2" = { order = "4.2" }
}
}
```
2022-01-03T13:25:07.166+0100 FATAL icingadb json: cannot unmarshal string into Go value of type int64
m_DataBuffer may be modified concurrently while StatsFunc() is called, thus
it's unsafe to call size() on it. As write access to m_DataBuffer is already
synchronized by only modifying it from the single work queue thread, instead of
adding a mutex, this commit adds a new std::atomic_size_t which is additionally
updated when modifying m_DataBuffer and can safely be accessed in StatsFunc().
There is no explicit synchronization of access to m_DataBuffer which is fine if
it is only accessed from the single-threaded work queue. However, Stop() also
called Flush() in another thread, leading to concurrent write access to
m_DataBuffer which can result in a crash due to use after free/double free.
Changes in this commit:
* Flush() is renamed to FlushWQ() to show that it should only be called from
the work queue. Additionally, it now asserts that it is running on the work
queue.
* Visibility of some data members is changed from protected to private. No
other classes have to access these at the moment. By this change, accidental
concurrent access from derived classes in the future is prevented.
* Stop() now flushes by posting FlushWQ() to the work queue and joining it.
In the 2.12.6 release, the full schema file sets the version to 1.14.3, whereas
the latest available upgrade file 2.11.0.sql sets it to 1.15.0. Therefore, ship
a new upgrade file 2.12.7.sql for all users who imported their schema with
version 2.11.0 or later and never performed an upgrade since then. Their
databases incorrectly state schema version 1.14.3 and is bumped to the correct
version 1.15.0 by the upgrade.
In the 2.13.2 release, the full schema file sets the version to 1.15.0, whereas
the latest available upgrade file 2.13.0.sql sets it to 1.15.1. Therefore,
rename the incorrectly named upgrade file 2.13.1.sql (it was not shipped in
this or any other release so far) to 2.13.3.sql for users who imported their
schema with version 2.13.0 or later and never performed an upgrade since then.
Their databases incorrectly state schema version 1.15.0 and are bumped to the
correct version 1.15.1 by the upgrade.
The full schema is not touched by this commit as for the current branch, this
was already fixed by 815533b334.
When creating a fixed downtime that starts immediately while the checkable is
in a non-OK state, previously the code path for flexible downtimes was used to
trigger this downtime. This is fixed by this commit which resolves two issued:
1. Missing downtime start notification: notifications work differently for
fixed and flexible downtimes. This resulted in missing downtime start
notifications under the conditions described above.
2. Incorrect downtime trigger time: this code path would incorrectly assume the
timestamp of the last checkable as the trigger time which is incorrect for
fixed downtimes.