In the current statistics counter implementation, the statistics are
backed by an array of counters, which are updated via atomic operations.
This leads to contention, especially on high core count
machines.
This commit introduces a new isc_statsmulti_t counter that keeps a
separate array per thread. These counters are then aggregated only when
statistics are queried, shifting work off the critical path.
These changes lead to a ~2% improvement in perflab.
Move all LMDB-based new zone database functions from server.c into
nzd.c to reduce the size of server.c and isolate the NZD/LMDB
interface. Rename load_nzf() to nzd_load_nzf() to match the nzd_
namespace.
Now that NZF write support is gone, remove the unused nzfwriter_t
typedef and nzfwriter parameter from delete_zoneconf(). Remove the
bool locked parameter and simplify the locking in do_modzone() and
rmzone() to unconditional lock/unlock pairs.
Drop the NZF (New Zone File) fallback for persisting runtime zone
configurations, making LMDB (NZD) the only storage backend. This
removes all #ifdef HAVE_LMDB conditionals, the meson 'lmdb' option,
and the NZF-related functions. LMDB is now a mandatory build
dependency.
The named-nzd2nzf tool is now always built.
If we are modifiying the zone, the zone must have been added before.
Don't overwrite this value on modifications.
Also it feels cleaner to pass added=false to configure_zone() in
do_modzone().
Some code paths try to lock an already locked view->newzone.lock.
For example, do_modzone() aqcuires the lock and then calls
delete_zoneconf(), that wants to acquire the same lock.
Add a parameter to delete_zoneconf() that informs the function if the
lock has already been acquired.
A few port validation checks use >= UINT16_MAX instead of > UINT16_MAX,
incorrectly rejecting port 65535 as out of range. Port 65535 is a valid
TCP/UDP port number. Other port checks in the same file already use the
correct > comparison.
The asynchronous SIG(0) handling improperly used srcaddr, and dstaddr
from the caller's stack and didn't attach to aclenv. This could
possibly lead to ACL bypass as an invalid srcaddr could be matched or
possible assertion failure if the ACL environment would change between
the initial call and the SIG(0) processing due to the server
reconfiguration. This has been fixed.
Make the maximum number of processed delegation nameservers configurable
via the new 'max-delegation-servers' option (default: 13), replacing the
hardcoded NS_PROCESSING_LIMIT (20).
The default is reduced to 13 to precisely match the maximum number of
root servers that can fit into a classic 512-byte UDP payload. This
provides a natural, historically sound cap that mitigates resource
exhaustion and amplification attacks from artificially inflated or
misconfigured delegations.
The configuration option is strictly bounded between 1 and 100 to ensure
resolver stability.
The granularity of the simple histogram with fixed number of ranges
sometimes isn't good enough. As there's a need to implement a new
histogram statistics for the incoming query times (RTT), it was decided
to also update the existing RTT statistics of the outgoing queries
so that they look similar and use common code.
Remove the old histogram code from the resolver and from the statistics
channel. Reimplement the outgoing queries RTT histogram using the
isc_histomulti module, and prepare the necessary base for implementing
the incoming queries RTT histogram. The statistics channel will be
updated to expose the new histograms in an upcoming commit.
For Linux >= 6.8:
Since 2023, Linux has introduced a change to the IP_LOCAL_PORT_RANGE
socket option that eliminates the need for the random window
shifting (implemented as a fallback in the next commit).
By setting IP_LOCAL_PORT_RANGE option, we tell the kernel to use better
approach to the source port selection.
For Linux << 6.8:
This implement selecting port by random shifting range leveraging the
IP_LOCAL_PORT_RANGE socket option. The network manager is initialized
with the ephemeral port range (on startup and on reconfig) and then for
every outgoing TCP connection, we define a custom port range (1000
ports) and then randomly shift the custom range within the system range.
This helps the kernel to reduce the search space to the custom window
between <random_offset, random_offset + 1000>.
Reference:
https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance/#kernel
`dns_view_findzonecut()` is used only in the context where the closest
name servers for a name need to be queried. In the future, this API
will also return the glues (if known) for those name servers, as well
as (exclusively, if both NS and DELEG exist) the DELEG record.
To avoid ambiguities with other code flows using `dns_db_findzonecut()`,
`dns_view_findzonecut()` has been renamed into `dns_view_bestzonecut()`.
Since the `sigrdataset` "output" parameter of `dns_view_findzonecut()`
is never used (always called with NULL), it is now removed.
Also, since the resolver is moving towards a parent-centric direction,
there is no point having a signature for the NS record (which is not
authoritative in the parent, so never signed) in the contextes where
`dns_view_findzonecut()` is called.
As `dns_view_findzonecut()` only returns either ISC_R_SUCCESS or
DNS_R_NXDOMAIN, and since it automatically disassociates the rdatasets
in case of failure, some call sites are simplified.
Currently we add an rrset-order cyclic statement to the default config.
Since the rrset-order allows matching a subset of all names, it must
be implemented with a string comparison against a wildcard, and since
the statement applies per rrset, this can result in millions of
comparisons per second on a busy authoritative server.
This commit removes rrset-order from the default config, but adds back
a code shim in query_setorder to preserve the previous behaviour.
Now that the configuration options `edns-version`, `edns-udp-size`,
`max-udp-size`, `no-cookie-udp-size` and `padding` have strict boundaries
(configuration failing if they are not respected), remove configuration
loading code which implicitely raises or lowers them.
A catalog zone is updated in an offloaded thread, which is not
stopped during a reconfiguration in an exclusive mode, and so
can cause a race condition with it.
Waiting for the offloaded threads to complete their work before
entering into the exclusive mode can potentially cause unwanted
delays, because offloaded threads are generally "allowed" to take
a longer amount of time before they complete.
Add a dns_catz_zone_prereconfig()/dns_catz_zone_postreconfig() pair
of functions which currently just lock the catalog zone when
reconfiguring it. The change should eliminate the race.
As a side note, there was already a similar pair of functions,
dns_catz_prereconfig() and dns_catz_postreconfig() which are called
before and after reconfiguring a 'dns_catz_zones_t' object.
Below are the stack traces of the reconfiguration thread which has
asserted, and a catalog zone update thread which was caught in the
middle of its work despite the fact that the exclusive mode is
turned on.
Stack trace of thread 23859:
#0 0x00007f80e7b8e52f raise (libc.so.6)
#1 0x00007f80e7b61e65 abort (libc.so.6)
#2 0x0000000000422558 assertion_failed (named)
#3 0x00007f80eaa6799e isc_assertion_failed (libisc-9.18.41.so)
#4 0x00007f80ea5bc788 dns_catz_entry_getname (libdns-9.18.41.so)
#5 0x000000000042ce0e catz_reconfigure (named)
#6 0x000000000042d3c5 configure_catz_zone (named)
#7 0x000000000042d7a4 configure_catz (named)
#8 0x0000000000430645 configure_view (named)
#9 0x000000000043d998 load_configuration (named)
#10 0x000000000044184f loadconfig (named)
#11 0x0000000000442525 named_server_reconfigcommand (named)
#12 0x000000000041b277 named_control_docommand (named)
#13 0x000000000041c74a control_command (named)
#14 0x00007f80eaa912ae task_run (libisc-9.18.41.so)
#15 0x00007f80eaa914cd isc_task_run (libisc-9.18.41.so)
#16 0x00007f80eaa46435 isc__nm_async_task (libisc-9.18.41.so)
#17 0x00007f80eaa467aa process_netievent (libisc-9.18.41.so)
#18 0x00007f80eaa475a6 process_queue (libisc-9.18.41.so)
#19 0x00007f80eaa46227 process_all_queues (libisc-9.18.41.so)
#20 0x00007f80eaa462a1 async_cb (libisc-9.18.41.so)
#21 0x00007f80e8d01893 uv__async_io.part.3 (libuv.so.1)
#22 0x00007f80e8d13ac4 uv__io_poll (libuv.so.1)
#23 0x00007f80e8d023fb uv_run (libuv.so.1)
#24 0x00007f80eaa45ced nm_thread (libisc-9.18.41.so)
#25 0x00007f80eaa9bda3 isc__trampoline_run (libisc-9.18.41.so)
#26 0x00007f80e7f1e1ca start_thread (libpthread.so.0)
#27 0x00007f80e7b798d3 __clone (libc.so.6)
...
...
Stack trace of thread 23912:
#0 0x00007f80ea5bc2da dns_catz_options_setdefault (libdns-9.18.41.so)
#1 0x00007f80ea5bd411 dns__catz_zones_merge (libdns-9.18.41.so)
#2 0x00007f80ea5c3c2f dns__catz_update_cb (libdns-9.18.41.so)
#3 0x00007f80eaa4fee9 isc__nm_work_run (libisc-9.18.41.so)
#4 0x00007f80eaa9bda3 isc__trampoline_run (libisc-9.18.41.so)
#5 0x00007f80eaa4ff48 isc__nm_work_cb (libisc-9.18.41.so)
#6 0x00007f80e8cfc75e worker (libuv.so.1)
#7 0x00007f80e7f1e1ca start_thread (libpthread.so.0)
#8 0x00007f80e7b798d3 __clone (libc.so.6)
Make all non-scalar properties of `cfg_obj_t` allocated values, which
ensures the union size is the width of one pointer. Also reorder the
fields inside `cfg_obj_t` to avoid alignment padding that would increase
the size. As a result, a `cfg_obj_t` instance is now 48 bytes on a
64-bit platform.
Add a static assertion to avoid increasing the size of the struct by
mistake.
The function `parse_sockaddrsub` was taking advantage of the fact that
both sockaddr and sockaddrtls were in the same position, and used to
initialize the sockaddr field independently if this was a -tls one or
not. This doesn't work anymore now that all fields are allocated,
so it has been slightly rewritten to take both cases into account
separately.
CLEANUP is a macro similar to CHECK but unconditional, jumping
to cleanup even if the result is ISC_R_SUCCESS. It is now used
in place of DST_RET, CLEANUP_WITH, and CHECK(<non-success constant>).
previously, there were over 40 separate definitions of CHECK macros, of
which most used "goto cleanup", and the rest "goto failure" or "goto
out". there were another 10 definitions of RETERR, of which most were
identical to CHECK, but some simply returned a result code instead of
jumping to a cleanup label.
this has now been standardized throughout the code base: RETERR is for
returning an error code in the case of an error, and CHECK is for jumping
to a cleanup tag, which is now always called "cleanup". both macros are
defined in isc/util.h.
In commit aea251f3bc, `isc_buffer_reserve()` was changed to
take a simple `isc_buffer_t *` instead of `isc_buffer_t **`.
A number of functions calling it have now been similarly
modified.
Wrap 'dns_keymgr_status()' in 'dns_zone_dnssecstatus()' so we can easily
retrieve the zone string name and refresh key time value.
In addition to the current time, output when the next key event is
expected.
Don't log keys that are completely hidden unless verbose is set.
Don't log key state values unless verbose is set, or they are in a
weird state.
For expected key states, log a more useful message of the stage of
the rollover. If we are in the middle of a key rollover, don't log
when the next key rollover is scheduled.
Condense the output for better readability.
Maintain the relationship between the parent and child fetch and when
creating a new child fetch, properly check the resolution loops that
would lead to a new fetch would join one of the parent's fetch contexts.
The "no root hints for view X" message must not be shown for the default
_bind/CH view. However, it is shown since 27c4f68dcc (part of effective
configuration changes).
The reason is that since 27c4f68dcc, `configure_views()` now processes
a single list of views, which contains both builtin and user views as
they are both part of the effective configuration. Those changes omitted
the `need_hints` bool that disabled the warning for the builtin view.
This commit silences the log message again.
when merging view objects into the effective configuration, add
allow-query-cache, allow-recursion, allow-query-cache-on and
allow-recursion-on ACLs as needed to reflect the way those
options inherit from each other.
this means the effective configuration is now correct for each
view. ACLs no longer need to be corrected when applying the
configuration, and the actual effective ACL values will be
displayed in "rndc showconf" and "named-checkconf -pe".
the merging of options and defaults into the effective configuration
broke the mutual inheritance of the allow-recursion, allow-query, and
allow-query-cache ACLs, and of the allow-recursion-on and
allow-query-cache-on ACLs.
this has been corrected by adding a 'cloned' flag to the cfg_obj
structure to indicate whether it was configured explicitly or
cloned from the defaults during parsing. we can then adjust the
ACLs while configuring a view, favoring user-configured values
when they're available over cloned defaults.
currently the adjustments to the ACLs are done in configure_view();
later they'll be moved into the effective configuration and this
special handling can be removed.
Because the asynchronous loading logic expected all jobs to be scheduled
then to be run (because it used to be scheduled during the exclusive
mode) and because all jobs are scheduled on various threads, there were
random situations where load_zones() would return after the scheduled
DB zone loading actually ran. In such cases, the zl->refs ref counter
in view_loaded() wouldn't go down to 0 and the remaining task to do
once all zones were loaded was never called. In particular,
server->reload_status kept the NAMED_RELOAD_PENDING state.
This problem is fixed by handling zoneload_t as a ref-counted object,
shared between load_zones() and each instance of scheduled zone DB
loading. Its destructor function is actually the content of
view_loaded() in the case the zt->refs went to 0. This ensures a
correct post-loading routine to be called once the last load is done.
The `reload_status` is set to `NAMED_RELOAD_FAILED` after the log line is
printed about this change. Update `reload_status` first, to avoid
(unlikely) case where a test waiting for this log line would attempt a
RNDC reload query but it would be processed by `named` before the status
is updated.
Remove the exclusive mode when scheduling the zone load right after
(re)loading `named` configuration, as there is no reason anymore to
schedule zone loading while the exclusive lock is held. Data which can
be read or written by multiple threads are locked or atomic.
Catalog-zones can't be used in view which are not from the IN class.
This is now enforced as the server won't load (instead of loading
without the catalog-zone). This configuration error is now also caught
by `named-checkconf`.
The `configure_view()` `need_hints` is removed as it this function was
always called with the value `true`.
The `need_hints` wasn't even used in the function. The only thing it was
actually used was to throw a warning which can be done simply in an
`else` condition branch.
Moreoever, in the case of catalog zones and response-policy, it fixes a
possible bug that would affect root zones, as those wouldn't be reverted
back to their previous version in case of the view fails to load
(during a server reconfiguration).
Do not save the textual version of the effective configuration when
`allow-new-zones` is enabled, as it can be printed on-demand. This
enable to reduce the memory footprint of ~70MB on huge configurations
(1M zones).
the effective configuration tree is now detached if allow-new-zones
or catalog-zones aren't enabled in any views. this reduces memory
consumption while still allowing "rndc showconf -effective" to work.
as previously mentioned in commit c65b2868ab, a cfg_obj_t
configuration tree structure takes up considerably more space than
the canonical text. since the zone configuration saved in the zone
object using dns_zone_setcfg() is only currently used for "rndc
showzone", it can be saved as text more efficiently than as an
object tree. (and, if a tree were needed, the text could be
re-parsed quickly; zone configuration text is generally small.)
The built-in configuration is actually used in two cases: first, when
the server is loaded (or reloaded), and second when
'rndc showconf -builtin' is called.
Considering the parsing of the builtin configuration is quick and does
not occur during exclusive mode, but the configuration tree takes
considerable memory space, the built-in configuration is no longer kept
in memory once it has been used; instead it is re-parsed on demand.
once the user configuration has been merged into the effective
configuration, it no longer needs to be accessed as a configuration
tree, but we still want to be able to show it with 'rndc showconf -user'.
because the recursive strucure of cfg_obj objects is fairly large, the
canonical text form is a fraction of the size of the configuration
tree, so we now save it in that form instead.