Without this commit, every time the NotificationTimerHandler runs it
will discard notifications that don't apply to the reason of the latest
check result. This is probably intended to clear outdated suppressed
notifications immediately when the TimePeriod resumes, but it also clears
out important ones (see the test case).
This commit fixes that by clearing out inapplicable notifications when
the timer runs the first time after the TimePeriod resumes. By that time
we can expect that no new suppressed notifications will be added and all
notifications that don't conflict with the last check-result can still be
run.
Fixes#10575
The fallback implementation was added for GCC 4.x as that didn't yet implement
std::is_trivially_copyable. However, by now we're using C++17 as our language
standard and that wasn't even implemented in GCC 4.x yet[^1]:
Some C++17 features are available since GCC 5, but support was experimental
and the ABI of C++17 features was not stable until GCC 9.
Hence, this became more or less dead code and can be removed.
[^1]: https://gcc.gnu.org/projects/cxx-status.html#cxx17
Due to this missing check, evaluating a DSL expression can result in a null
dereference, crashing the Icinga 2 process. Given that API users can also
provide DSL expression as filters, this can be triggered over the network as
well.
This issue was assigned CVE-2025-61908.
This is to avoid another kind of exploit found by where TicketSalt
can be accessed when the object filter is evaluated by checking
its name via the local `variable` reference and then `throw`ing
it to print it in the error message.
Reported-by: julian.brost@icinga.com
+ get_objects(): Has no use because in sandboxed contexts the result
can't be filtered or iterated over.
+ get_template(): Currently this is not dangerous because the returned
dictionary object does not hold any interesting
information. However, someone could add more details
in the future and forget to add a permission check.
+ get_templates(): Combines the reasons for get_objects() and
get_template()
+ get_env(): There is no point of ever using this in a filter expression.
It's annoying to have to wait 10 seconds for the `liveness_disconnect`
test to complete, so make the timeout configurable and set it to a way
lower value to test the functionality.
So far, calling Checkable::IsReachable() traversed all possible paths to it's
parents. In case a parent is reachable via multiple paths, all it's parents
were evaluated multiple times, result in a worst-case exponential complexity.
With this commit, the implementation keeps track of which checkables were
already visited and uses the already-computed reachability instead of repeating
the computation, ensuring a worst-case linear runtime within the graph size.
Checkable::IsReachable() and DependencyGroup::GetState() call each other
recursively. Moving them to a common helper class allows adding caching to them
in a later commit without having to pass a cache between the functions (through
a public interface) or resorting to thread_local variables.
In contrast to the regular `kill` binary, `icinga2 internal signal` drops
permissions before sending the signal. This is important as the PID file can be
written by the Icinga user, dropping the permissions prevents that user from
using this to send signals to processes it is not supposed to signal.
SIGUSR1 wasn't among the list of signals supported by `icinga2 internal
signal`, so it is added there.
This commit removes the existing m_IsNoOp bool and instead wraps the m_Buffer
std::ostringstream into std::optional. Functionally, this is pretty much the
same, with the exception that std::ostringstream is no longer constructed for
messages that will be discarded later.
There already is a template operator<< implemented, so far only for const
references though. Changing this to perfectly forward the argument to the
corresponding operator in the underlying std::ostringstring allows handling all
the cases there, removing the need for a separate overload for const char*.
Once the new worker process has read the config, it also includes a
`include */include.conf` statement within the config packages root
directory, and from there on we must not allow to delete any stage
directory from the config package. Otherwise, when the worker actually
evaluates that include statement, it will fail to find the directory
where the include file is located, or the `active.conf` file, which is
included from each stage's `include.conf` file, thus causing the worker
fail.
Co-Authored-By: Johannes Schmidt <johannes.schmidt@icinga.com>
Previously, we used a simple boolean to track the state of the package updates,
and didn't reset it back when the config validation was successful because it was
assumed that if we successfully validated the config beforehand, then the worker
would also successfully reload the config afterwards, and that the old worker would
be terminated. However, this assumption is not always true due to a number of reasons
that I can't even think of right now, but the most obvious one is that after we successfully
validated the config, the config might have changed again before the worker was able
to reload it. If that happens, then the new worker might fail to successfully validate
the config due to the recent changes, in which case the old worker would remain active,
and this flag would still be set to true, causing any subsequent requests to fail with a
`423` until you manually restart the Icinga 2 service.
So, in order to prevent such a situation, we are additionally tracking the last time a reload
failed and allow to bypass the `m_RunningPackageUpdates` flag only if the last reload failed
time was changed since the previous request.
This commit intruduces a small helper class that wraps any writer and
provides a flush operation that performs the corresponding action if the
writer is an AsyncJsonWriter and does nothing otherwise.
Replacing invalid UTF-8 characters beforehand by our selves doesn't make
any sense, the serializer can literally perform the same replacement ops
with the exact same Unicode replacement character (U+FFFD) on its own.
So, why not just use it directly? Instead of wasting memory on a temporary
`String` object to always UTF-8 validate every and each value, we just
use the serializer to directly to dump the replaced char (if any) into
the output writer. No memory waste, no fuss!
The Array, Dictionary, and Namespace types provide a Freeze() method that makes
them read-only. So far, there was the possibility to call some methods with
`overrideFrozen=true` which would then bypass the corresponding check and allow
modification of the data structures nonetheless.
With 24b57f0d3a, this possibility was already
removed from the Namespace type. However, for interface compatibility, it kept
the parameter and just ignores it, throwing an exception on any modification on
a frozen instance.
The only place using `overrideFrozen` was processing of the `-D`/`--define`
command line flag that allows setting additional variables in the DSL. At the
time it is evaluated, there are no user-created data structures yet that could
be frozen, so the only frozen objects that could be encountered are Namespaces
(Icinga doesn't freeze other types by itself) and for these, `overrideFrozen`
already has no effect.
Hence, there is no harm in removing `overrideFrozen` altogether. This
simplifies the code and also means that frozen objects are now indeed read-only
without exceptions, allowing further optimizations regarding locking in the
future.
The wait group gets passed to HttpServerConnection, then down to the
HttpHandlers. For those handlers that modify the program state, the
wait group is locked so ApiListener will wait on Stop() for the
request to complete. If the request iterates over config objects,
a further check on the state of the wait group is added to abort early
and not delay program shutdown. In that case, 503 responses will be
sent to the client.
Additionally, in HttpServerConnection, no further requests than the
one already started will be allowed once the wait group is joining.
CalcEventID's internal logic uses the TimestampToMilliseconds function
to convert the given eventTime to milliseconds. Within this function,
the timestamp is capped to prevent an overflow.
On three occasions, the input timestamp given to CalcEventID had already
been converted using TimestampToMilliseconds. The second
TimestampToMilliseconds function then checked the value and always
returned the capped maximum value. Consequently, CalcEventID returned
the same hash value for different timestamps.
This affected SendFlappingChange, SendAcknowledgementSet, and
SendAcknowledgementCleared. For example, when two acknowledgments were
created for the same service, the calculated event_id representing the
history table row would be identical.
Fixes#10465
Since the types and states attributes are user configurable and allowed to change at
runtime, we need to update the actual filter bitsets whenever these attributes change.
Otherwise, the filter bitsets would be stale and not reflect their current state.
No external user needs to manipulate the actual object dependency
graphs. This was maybe introduced for debugging purposes at that time
but if someone messes with this in prod - good luck with that. Oh, apart
from that it's broken :( and doesn't track parents as its implies but
children.
Not sure why it's introduced in the first place, maybe for debugging
purposes at the early stage of Icinga 2 dev but I failed to see an
actual useful use case for it that's worth its maintenance burden. So,
this commit dropped it entirely from the DSL language.
Boost `1.88.0` introduced a feature [^1] that makes use of the Windows API, but it
uses API functions that are only available with `PSAPI_VERSION >= 2` and
Windows VISTA only supports `PSAPI_VERSION == 1`. Actually, that new feature
can also be disabled by setting the `BOOST_STACKTRACE_DISABLE_OFFSET_ADDR_BASE`
macro, but since it seems to be a useful feature and isn't even disabled by default,
we can just drop it that ancient Windows version instead of disabling it.
[^1]: https://github.com/boostorg/stacktrace/pull/200
With this mutex, we can have deadlock in the following case:
1/ Thread A processes a /v1/actions/acknowledge-problem request and locks the checkable
2/ Thread B processes a /v1/actions/add-comment and enters first the ConfigItem::ActivateItems() method and locks the static mutex there and starts the just created comment object, which triggers the OnCommentAdded() event.
3/ Thread A wants to activate the just created ack comment as well but since the mutex is already locked by TB, it blocks.
4/ Thread B's OnCommentAdded() even dispatch causes the IcingaDB::CommentAddedHandler() to be called and implicitly triggers full state update for the checkable. Now, the state serialization of that checkable (remember that's the same checkable currently locked by TA) also includes computing its severity, thus it calls either service->GetSeverity() or host->GetSeverity(). However, since computing the checkable severity (as of now) requires acquiring the object lock, and boom - they deadlock each other.
Pin child objects of hosts (HOST!...) to the same endpoint as the host.
This reduces cross-object action latency withing the same host.
If all endpoints know this algorithm, we can use it.
Old versions of OpenSSL stored a valid flag in the certificate (see inline code
comment for details) that if already set, causes parts of the verification to
be skipped and return that the certificate is valid, even if it's not actually
signed by the CA in the trust store.
This issue was assigned CVE-2025-48057.
`X509_STORE_CTX_get_error(csc)` was called after `X509_STORE_CTX_free(csc)`.
This is fixed by automatically freeing variables at the end of the function
using `std::unique_ptr`.
Previously, the `command#timeout` which by default is `1m`, was used to reschedule
the just sent remote check. However, this results into a bunch of extra checks being
sent to the remote host, even though the first one is still running. That's because
if one want to override the default timeout of the command for a specific host/service,
one has to set the `checkable#check_timeout` attribute to the desired value. So, this
commit makes sure that the `checkable#check_timeout` attribute (if set) is used to
reschedule the remote check.
Instead of IoEngine::YieldCurrentCoroutine(yc) until m_Queues.FutureResponseActions.empty(), async-wait a CV which is updated along with m_Queues.FutureResponseActions.
They're just useless, since a `CheckResult` handler is never going to be
called without a check result and a checkable can't exist without a
checkcommand.
Previously, recovery and ACK notifications were not delivered to users
who weren't notified about the problem state while having a configured
`Problem` type filter. However, since the type filter can also be
configured on the `Notification` object level, this resulted to an
incorrect behaviour. This PR changes the existing logic so that the
recovery and ACK notifications gets dropped only if the `Problem` filter
is configured on both the `User` and `Notification` object levels.
Ensure that the counter unit of measurement, "c", is parsed correctly
for performance data values again.
A prior refactoring in 720a88c29a changed
the parsing logic, resulting in an incorrect behavior for counter units.
By passing the raw input into the l_CsUoMs map first, the "c" UoM is
removed. Moving the explicit counter check before passing the raw unit
into the map resolves this issue.
Fixes#9540.
Previously, if you enable flapping for a Checkable but the corresponding
`Notification` object does not have `FlappingStart` or `FlappingEnd`
types set, the `no_more_notifications` flag wasn't reset to false again.
This commit ensures that this flag is always reset on `Recovery` even
the type filter does not match including when we miss the `Recovery` due
to Flapping state.
Notification objects can refer User and Group objects similar to how they can
refer Host and Service objects, so that dependency feels quite natural. Note
that for evaluating most configuration, this order doesn't really matter, the
configuration will successfully evaluate in either case, the difference can be
noticed mainly in more advanced configurations, for example when dynamically
assigning user based on their groups. When accessing user objects from the
Notification object definition (like in the following example), without this
change, only groups configured directly in groups attribute of User objects are
visible and those added via assign clauses in UserGroup objects are missing.
With this commit, these are also visible.
apply Notification "n" to Host {
for (var u in get_objects(User)) {
log(u.name + " -> " + Json.encode(u.groups))
}
# [...]
}
There are inputs to mktime() where the behavior is not specified and there's
also no single obviously correct behavior. In particular, this affects how
auto-detection of whether DST is in effect is done when tm_isdst = -1 is set
and the time specified does not exist at all or exists twice on that day.
If different implementations are used within an Icinga 2 cluster, that can lead
to inconsistent behavior because different nodes may interpret the same
TimePeriod differently.
This commit introduces a wrapper to mktime(), namely Utility::NormalizeTm()
that implements the behavior provided by glibc. The choice for glibc's behavior
is pretty arbitrary, it was simply picked because most systems that are
officially/fully supported use it (with the only exception being Windows), so
this should give the least possible amount of user-visible changes.
As part of this commit, the closely related helper function mktime_const() is
also moved to Utility::TmToTimestamp() and made a wrapper around the newly
introduced NormalizeTm().
A second abort() is needed at the end of `SigAbrtHandler()` to trigger the SIG_DFL action (in this case the core dump).
Also since `AttachDebugger()` disables the ability to dump core, so
it gets reenabled after returning from it.
for (const T& needle : haystack) creates the illusion that haystack is a
container of T and we're just borrowing needle. In these cases that's not true.
f isn't used otherwise in the function, so if possible, it can just be moved into the lambda, avoiding a copy.
Co-authored-by: Alexander Aleksandrovič Klimov <alexander.klimov@icinga.com>
not just boost::coroutines::detail::forced_unwind.
This is needed because as of Boost 1.87, boost::asio::spawn() uses Fiber, not Coroutine v1.
https://github.com/boostorg/asio/commit/df973a85ed69f021
This is safe because every actual exception shall inherit from std::exception. Except forced_unwind and its Fiber equivalent, so that `catch(const std::exception&)` doesn't catch them and only them.
The timestamps used both in the CheckerComponent and Checkable debug
logs were printed in the scientific notation, making them effectively
useless.
> debug/CheckerComponent: Scheduling info for checkable 'host!service' (2025-02-26 14:53:16 +0100): Object 'host!service', Next Check: 2025-02-26 14:53:16 +0100(1.74058e+09).
> debug/Checkable: Update checkable 'host!service' with check interval '300' from last check time at 2025-02-26 14:48:47 +0100 (1.74058e+09) to next check time at 2025-02-26 14:58:12 +0100 (1.74058e+09).
Switching to std::fixed actually shows the complete Unix timestamp.
> debug/CheckerComponent: Scheduling info for checkable 'host!service' (2025-02-26 15:36:44 +0000): Object 'host!service', Next Check: 2025-02-26 15:36:44 +0000 (1740584204).
> debug/Checkable: Update checkable 'host!service' with check interval '60' from last check time at 2025-02-26 15:37:11 +0000 (1740584232) to next check time at 2025-02-26 15:38:09 +0000 (1740584290).
So far, Service::GetSeverity() only considered the state of its own host, i.e.
the implicit service to its own host dependency, and treated it similar to
acknowledgements and downtimes. In contrast, Host::GetSeverity() considered
reachability and treated it like a state, i.e. for the severity calculation,
the host was either up, down, or unreachable.
This commit changes the following things:
1. Make the service severity also consider explicitly configured dependencies
by using IsReachable().
2. Prefer acknowledgements and downtimes over unreachability in the severity
calculation so that if an already acknowledged or in-downtime services (i.e.
already handled service) becomes unreachable, it shouln't become more
severe.
3. To unify host and service severities a bit, hosts now use the same logic
that treats reachability more like acknowledgements/downtimes instead of
like a state (changing the other way around would the state from the check
plugin would not affect the severity for unrachable services anymore).
No change in functionality. The first two branches actually set the final
return value for the method, so they can just return directly, removing the
need to have the rest of the function inside an else block.
No change in functionality. The first two branches actually set the final
return value for the method, so they can just return directly, removing the
need to have the rest of the function inside an else block.
ApiListener#RelayMessageOne() relays every given message to the first connected endpoint Zone#GetEndpoints() returns. Randomness in combination with bad luck can direct more traffic (from a particular network segment) to one master than the admin wants.
This change lets the Zone#endpoints order prefer one endpoint over the other.
This prevents the use of DependencyGroup for storing the dependencies during
the early registration (m_DependencyGroupsPushedToRegistry = false),
m_PendingDependencies is introduced as a replacement to store the dependencies
at that time.
Previously the dependency state was evaluated by picking the first
dependency object from the batched members. However, since the
dependency `disable_{checks,notifications` attributes aren't taken into
account when batching the members, the evaluated state may yield a wrong
result for some Checkables due to some random dependency from other
Checkable of that group that has the `disable_{checks,notifications`
attrs set. This commit forces the callers to always provide the child
Checkable the state is evaluated for and picks only the dependency
objects of that child Checkable.
The new implementation just counts reachable and available parents and
determines the overall result by comparing numbers, see inline comments for
more information.
This also fixes an issue in the previous implementation: if it didn't return
early from the loop, it would just return the state of the last parent
considered which may not actually represent the group state accurately.