Instead of IoEngine::YieldCurrentCoroutine(yc) until m_Queues.FutureResponseActions.empty(), async-wait a CV which is updated along with m_Queues.FutureResponseActions.
They're just useless, since a `CheckResult` handler is never going to be
called without a check result and a checkable can't exist without a
checkcommand.
Previously, recovery and ACK notifications were not delivered to users
who weren't notified about the problem state while having a configured
`Problem` type filter. However, since the type filter can also be
configured on the `Notification` object level, this resulted to an
incorrect behaviour. This PR changes the existing logic so that the
recovery and ACK notifications gets dropped only if the `Problem` filter
is configured on both the `User` and `Notification` object levels.
Ensure that the counter unit of measurement, "c", is parsed correctly
for performance data values again.
A prior refactoring in 720a88c29a changed
the parsing logic, resulting in an incorrect behavior for counter units.
By passing the raw input into the l_CsUoMs map first, the "c" UoM is
removed. Moving the explicit counter check before passing the raw unit
into the map resolves this issue.
Fixes#9540.
Previously, if you enable flapping for a Checkable but the corresponding
`Notification` object does not have `FlappingStart` or `FlappingEnd`
types set, the `no_more_notifications` flag wasn't reset to false again.
This commit ensures that this flag is always reset on `Recovery` even
the type filter does not match including when we miss the `Recovery` due
to Flapping state.
Notification objects can refer User and Group objects similar to how they can
refer Host and Service objects, so that dependency feels quite natural. Note
that for evaluating most configuration, this order doesn't really matter, the
configuration will successfully evaluate in either case, the difference can be
noticed mainly in more advanced configurations, for example when dynamically
assigning user based on their groups. When accessing user objects from the
Notification object definition (like in the following example), without this
change, only groups configured directly in groups attribute of User objects are
visible and those added via assign clauses in UserGroup objects are missing.
With this commit, these are also visible.
apply Notification "n" to Host {
for (var u in get_objects(User)) {
log(u.name + " -> " + Json.encode(u.groups))
}
# [...]
}
There are inputs to mktime() where the behavior is not specified and there's
also no single obviously correct behavior. In particular, this affects how
auto-detection of whether DST is in effect is done when tm_isdst = -1 is set
and the time specified does not exist at all or exists twice on that day.
If different implementations are used within an Icinga 2 cluster, that can lead
to inconsistent behavior because different nodes may interpret the same
TimePeriod differently.
This commit introduces a wrapper to mktime(), namely Utility::NormalizeTm()
that implements the behavior provided by glibc. The choice for glibc's behavior
is pretty arbitrary, it was simply picked because most systems that are
officially/fully supported use it (with the only exception being Windows), so
this should give the least possible amount of user-visible changes.
As part of this commit, the closely related helper function mktime_const() is
also moved to Utility::TmToTimestamp() and made a wrapper around the newly
introduced NormalizeTm().
A second abort() is needed at the end of `SigAbrtHandler()` to trigger the SIG_DFL action (in this case the core dump).
Also since `AttachDebugger()` disables the ability to dump core, so
it gets reenabled after returning from it.
for (const T& needle : haystack) creates the illusion that haystack is a
container of T and we're just borrowing needle. In these cases that's not true.
f isn't used otherwise in the function, so if possible, it can just be moved into the lambda, avoiding a copy.
Co-authored-by: Alexander Aleksandrovič Klimov <alexander.klimov@icinga.com>
not just boost::coroutines::detail::forced_unwind.
This is needed because as of Boost 1.87, boost::asio::spawn() uses Fiber, not Coroutine v1.
https://github.com/boostorg/asio/commit/df973a85ed69f021
This is safe because every actual exception shall inherit from std::exception. Except forced_unwind and its Fiber equivalent, so that `catch(const std::exception&)` doesn't catch them and only them.
The timestamps used both in the CheckerComponent and Checkable debug
logs were printed in the scientific notation, making them effectively
useless.
> debug/CheckerComponent: Scheduling info for checkable 'host!service' (2025-02-26 14:53:16 +0100): Object 'host!service', Next Check: 2025-02-26 14:53:16 +0100(1.74058e+09).
> debug/Checkable: Update checkable 'host!service' with check interval '300' from last check time at 2025-02-26 14:48:47 +0100 (1.74058e+09) to next check time at 2025-02-26 14:58:12 +0100 (1.74058e+09).
Switching to std::fixed actually shows the complete Unix timestamp.
> debug/CheckerComponent: Scheduling info for checkable 'host!service' (2025-02-26 15:36:44 +0000): Object 'host!service', Next Check: 2025-02-26 15:36:44 +0000 (1740584204).
> debug/Checkable: Update checkable 'host!service' with check interval '60' from last check time at 2025-02-26 15:37:11 +0000 (1740584232) to next check time at 2025-02-26 15:38:09 +0000 (1740584290).
So far, Service::GetSeverity() only considered the state of its own host, i.e.
the implicit service to its own host dependency, and treated it similar to
acknowledgements and downtimes. In contrast, Host::GetSeverity() considered
reachability and treated it like a state, i.e. for the severity calculation,
the host was either up, down, or unreachable.
This commit changes the following things:
1. Make the service severity also consider explicitly configured dependencies
by using IsReachable().
2. Prefer acknowledgements and downtimes over unreachability in the severity
calculation so that if an already acknowledged or in-downtime services (i.e.
already handled service) becomes unreachable, it shouln't become more
severe.
3. To unify host and service severities a bit, hosts now use the same logic
that treats reachability more like acknowledgements/downtimes instead of
like a state (changing the other way around would the state from the check
plugin would not affect the severity for unrachable services anymore).
No change in functionality. The first two branches actually set the final
return value for the method, so they can just return directly, removing the
need to have the rest of the function inside an else block.
No change in functionality. The first two branches actually set the final
return value for the method, so they can just return directly, removing the
need to have the rest of the function inside an else block.
ApiListener#RelayMessageOne() relays every given message to the first connected endpoint Zone#GetEndpoints() returns. Randomness in combination with bad luck can direct more traffic (from a particular network segment) to one master than the admin wants.
This change lets the Zone#endpoints order prefer one endpoint over the other.
This prevents the use of DependencyGroup for storing the dependencies during
the early registration (m_DependencyGroupsPushedToRegistry = false),
m_PendingDependencies is introduced as a replacement to store the dependencies
at that time.
Previously the dependency state was evaluated by picking the first
dependency object from the batched members. However, since the
dependency `disable_{checks,notifications` attributes aren't taken into
account when batching the members, the evaluated state may yield a wrong
result for some Checkables due to some random dependency from other
Checkable of that group that has the `disable_{checks,notifications`
attrs set. This commit forces the callers to always provide the child
Checkable the state is evaluated for and picks only the dependency
objects of that child Checkable.
The new implementation just counts reachable and available parents and
determines the overall result by comparing numbers, see inline comments for
more information.
This also fixes an issue in the previous implementation: if it didn't return
early from the loop, it would just return the state of the last parent
considered which may not actually represent the group state accurately.
* Icinga daemon leaves zombie processes on very busy system
On a very heavily loaded system the process group kill can
be delayed until after the regular TERM signal has caused
the process to exit. In this situation the waitpid call
is valid and reaps the zombie process that would otherwise
be left behind.
* Update AUTHORS file
Services downtimes scheduled via the `all_services` flag get already
removed automatically when removing their parent downtimes (introduced
with #8913). Now, this commit makes it possible to perform the same actions
for all child downtimes, i.e. not only for those of service objects, but
for all child objects represented in the dependency tree.
`checkable` is already added to the set by the insert call above, so calling
emplace for the same checkable doesn't do anything useful and can be removed.
The previous wasn't per-se wrong, but it was way too inefficient. With
this commit each and every Checkable is going to be visited only once,
and we won't traverse the same Checkable's children multiple times
somewhere in the dependency chain.
The previous limit (32) doesn't seem to make sense, and appears to be some random number.
So, this limit is set to 256 to match the limit in IsReachable().
This commit removes a distinction in how dependency objects are checked for
cycles in the resulting graph depending on whether they are part of the
initially loaded configuration during process startup or as part of a runtime
update.
The DependencyCycleChecker helper class is extended with a mechanism that
allows additional dependencies to be considered during the cycle search. This
allows using it to check for cycles before actually registering the
dependencies with the checkables.
The aforementioned case-distinction for initial/runtime-update config is
removed by making use of the newly added BeforeOnAllConfigLoaded signal to
perform the cycle check at once for each batch of dependencies inside
ConfigItem::CommitNewItems() for both cases now. During the initial config
loading, there can be multiple batches of dependencies as objects from apply
rules are created separately, so parts of the dependency graph might be visited
multiple times now, however that is limited to a minimum as only parts of the
graph that are reachable from the newly added dependencies are searched.
This commit groups a bunch of structs and static functions inside
dependency.cpp into a new DependencyCycleChecker helper class. In the process,
the implementation was changed a bit, the behavior should be unchanged except
for a more user-friendly error message in the exception.
Boost only implements it iself starting from version 1.74, but a specialization
of std::hash<> can be added trivially to allow the use of
std::unordered_set<boost::intrusive_ptr<T>> and
std::unordered_map<boost::intrusive_ptr<K>, V>.
Being unable to use such types already came up a few types in the past, often
resulting in the use of raw pointer instead which always involves an additional
"is this safe?"/"could the object go out of scope?" discussion. This commit
simply solves this for the future by simply allowing the use of intrusive_ptr
in unordered containers.
Allows to hook into the config loading process just before OnAllConfigLoaded()
is called on a bunch of individual config objects. Allows doing some operations
more efficiently at once for all objects.
Intended use: when adding a number of dependencies, it has to be checked
whether this uses any cycles. This can be done more efficiently if all
dependencies are checked at once. So far, this is with a case-distinction for
initially loaded files in DaemonUtility::LoadConfigFiles() and for dependencies
created by runtime updates in Dependency::OnAllConfigLoaded(). The mechanism
added by this commit allows to unify the handling of both cases (done in a
following commit).
The move `String(Value&&)` constructor tries to partially move `String`
values from a `Value` type. However, since there was no an appropriate
`Value::Get<T>()` implementation that binds to the requested move
operation, the compiler will actually not move the value but copy it
instead as the only available implementation of `Value::Get<T>()`
returns a const reference `const T&`. This commit adds a new overload
that returns a non-const reference and allows to optionally move the string
value of a Value type.
The Icinga DB code performs intensive operations on certain STL containers,
primarily on `std::vector<String>`. Specifically, it inserts 2-3 new elements
at the beginning of a vector containing thousands of elements. Without this commit,
all the existing elements would be unnecessarily copied just to accommodate the new
elements at the front. By making this change, the compiler is able to optimize STL
operations like `push_back`, `emplace_back`, and `insert`, enabling it to prefer the
move constructor over copy operations, provided it is guaranteed that no exceptions
will be thrown.
The Icinga DB daemon processes the data from the `IcingaApplication`
type only and Icinga DB Web also uses only those stats. However, before
this commit, Icinga DB published all kinds of useless stats to Redis
each second, like the number of (un)reachable hosts, services, and so
on, which is waste of CPU and some other resources. This commit reduces
the published data drastically to only those simple stats coming from
the `IcingaApplication` type.
This was mistakenly introduced with PR #7686 due to too many open
connections (#7680). This was wrong in the sense that closing the
connection is simply out of place here and should have been handled
differently. After we revised the RPC connection disconnect procedure
with `v2.14.4`, it becomes clear why it is wrong, because the connection
is closed abruptly before the corresponding response (`result`) has
even been written. Now if you remove the disconnect here, shouldn't the
issue #7680 occur again, you ask? The answer is no, because we now also
have a maximum timeout of `10s` for anonymous connections, after which
they are automatically closed. Thanks to the introduction of this
timeout by @julianbrost in #8479, this `Disconnect()` call has become
superfluous.
Some fault monitoring plugins may return "inf" or "-inf" as
values due to a failure to initialize or other errors.
This patch introduces a check on whether the parse value is infinite
(or negative infinite) and rejects the data point if that is the case.
The reasoning here is: There is no possible way a value of "inf" is ever
a true measuring or even useful. Furthermore, when passed to the
performance data writers, it may be rejected by the backend and lead
to further complications.
PR #7445 incorrectly assumed that a peer that had already disconnected
and never reconnected was due to the endpoint client being dropped after
a successful socket shutdown. However, the issue at that time was that
there was not a single timeout guards that could cancel the `async_shutdown`
call, petentially blocking indefinetely. Although removing the client from
cache early might have allowed the endpoint to reconnect, it did not
resolve the underlying problem. Now that we have a proper cancellation
timeout, we can wait until the currently used socket is fully closed
before dropping the client from our cache. When our socket termination
works reliably, the `ApiListener` reconnect timer should attempt to
reconnect this endpoint after the next tick. Additionally, we now have
logs both for before and after socket termination, which may help
identify if it is hanging somewhere in between.
It's not used. Also, the callback shall run completely at once. This ensures that it won't (continue to) run once another coroutine on the strand calls Timeout#Cancel().
The reason for introducing AsioTlsStream::GracefulDisconnect() was to handle
the TLS shutdown properly with a timeout since it involves a timeout. However,
the implementation of this timeout involves spwaning coroutines which are
redundant in some cases. This commit adds comments to the remaining calls of
async_shutdown() stating why calling it is safe in these places.
Calling `AsioTlsStream::async_shutdown()` performs a TLS shutdown which
exchanges messages (that's why it takes a `yield_context`) and thus has the
potential to block the coroutine. Therefore, it should be protected with a
timeout. As `async_shutdown()` doesn't simply take a timeout, this has to be
implemented using a timer. So far, these timers are scattered throughout the
codebase with some places missing them entirely. This commit adds helper
functions to properly shutdown a TLS connection with a single function call.
The .ti files call `DependencyGraph::AddDependency(this, service.get())`. Obviously, `service.get()` is the parent and `this` (Downtime, Notification, ...) is the child. The DependencyGraph terminology should reflect this not to confuse its future users.
By design, only one Icinga 2 instance should be responsible in the HA
context. If this promise is broken, the Icinga 2 IcingaDB check should
report it.
The code did not check for invalid data in icingadb:telemetry:heartbeat.
With this change, it will go CRITICAL with a descriptive message and
report the actual number of icingadb_responsible_instances in the
performance data.
Currently, for each `Disconnect()` call, we spawn a coroutine, but every
one of them is just usesless, except the first one. However, since all
`Disconnect()` usages share the same asio strand and cannot interfere
with each other, spawning another coroutine within `Disconnect()` isn't
even necessary. When a coroutine calls `Disconnect()` now, it will
immediately initiate an async shutdown of the socket, potentially causing
the coroutine to yield and allowing the others to resume. Therefore, the
`m_ShuttingDown` flag is still required by the coroutines to be checked
regularly.
In fact, this is already done for the outer loop (for each bulk), just not yet for the inner one (for each message of a bulk). So once the remote signals EOF, don't try to process the remaining queue until write error (which can't be associated with a particular message anyway, due to buffering), but just let the peer go. Flush already half-written messages, though, if possible.
by not calling `std::atomic<T>::atomic(void)`.
After the latter the instance "does not contain a T object, and its only valid uses are destruction and initialization by std::atomic_init" which we don't call. So the only safe option is `std::atomic<T>::atomic(T)`.
https://en.cppreference.com/w/cpp/atomic/atomic/atomic
When the `Desconnect()` method is called, clients are not disconnected
immediately. Instead, a new coroutine is spawned using the same strand
as the other coroutines. This coroutine calls `async_shutdown` on the
TCP socket, which might be blocking. However, in order not to block
indefintely, the `Timeout` class cancels all operations on the socket
after `10` seconds. Though, the timeout does not trigger the handler
immediately; it creates spawns another coroutine using the same strand
as in the `JsonRpcConnection` class. This can cause unexpected delays if
e.g. `HandleIncomingMessages` gets resumed before the coroutine from the
timeout class. Apart from that, the coroutine for writing messages uses
the same condition, making the two symmetrical.
Something is definitely going wrong if a client tries to reconnect to
this endpoint while it still has an active connection to that client. So
we shouldn't hide this, but at least log it at info level. Apart from
that, I've added some additional information about the currently active
client, such as when the last message was sent and received.
The Redis ACL system was introduced with Redis 6.0. It introduced users
with precisely granular permissions. This change allows Icinga 2 to use
the Icinga DB feature against a Redis with an ACL user.
This was reflected in the documentation, next to the already
implemented, but undocumented Redis database.
Closes#9536.
Each configuration field of an IcingaDB Object was marked with
no_user_modify as modifications via the API would not result in an
actual change. While the Object would be updated, the internal Redis
connection would not be restarted, resulting in an unexpected behavior.
The missing db_index was added to the documentation.
The previous validation in set_verify_callback() could be bypassed, tricking
Icinga 2 into treating invalid certificates as valid. To fix this, the
validation checks were moved into the IsVerifyOK() function.
This is tracked as CVE-2024-49369, more details will be published at a later time.
This fixes an issue where recovery notifications get lost if they happen
outside of a notification time period.
Not all calls to `Checkable::NotificationReasonApplies()` need
`GetStateBeforeSuppression()` to be checked. In fact, for one caller,
`FireSuppressedNotifications()` in
`lib/notification/notificationcomponent.cpp`, the state before suppression may
not even be initialized properly, so that the default value of OK is used which
can lead to incorrect return values. Note the difference between suppressions
happening on the level of the `Checkable` object level and the `Notification`
object level. Only the first sets the state before suppression in the
`Checkable` object, but so far, also the latter used that value incorrectly.
This commit moves the check of `GetStateBeforeSuppression()` from
`Checkable::NotificationReasonApplies()` to the one place where it's actually
relevant: `Checkable::FireSuppressedNotifications()`. This made the existing
call to `NotificationReasonApplies()` unneccessary as it would always return
true: the `type` argument is computed based on the current check result, so
there's no need to check it against the current check result.
`m_IsNoOp` was introduced to avoid building up log messages that will later be
discarded, like debug messages if no debug logging is configured. However, it
looks like the template operator<< implemented in the header file was forgotten
when adding this feature, all other places writing into `m_Buffer` already have
an if guard like added by this commit.
error: no matching function for call to 'intrusive_ptr_release'
...
candidate function not viable: cannot convert argument of incomplete type 'icinga::Notification *' to 'Object *' for 1st argument
void intrusive_ptr_release(Object *object);